Development of a quantitative structure activity relations (QSAR) model to guide the design of fluorescent dyes for detecting amyloid fibrils DI Inshyn1, VB Kovalska1, MY Losytskyy1, YL Slominskii2, OI Tolmachev2, SM Yarmoluk1 1Institute 2Institute

of Molecular Biology and Genetics, National Academy of Sciences of Ukraine, 03143 Kyiv, Ukraine, and of Organic Chemistry, National Academy of Sciences of Ukraine, Murmans’ka Street 5, 02094 Kyiv, Ukraine

Biotech Histochem Downloaded from informahealthcare.com by Northeastern University on 01/13/14 For personal use only.

Accepted March 11, 2013

Abstract Quantitative structure activity relationship (QSAR) studies were performed on a set of polymethine compounds to develop new fluorescent probes for detecting amyloid fibrils. Two different approaches were evaluated for developing a predictive model: part least squares (PLS) regression and an artificial neural network (ANN). A set of 60 relevant molecular descriptors were selected by performing principal component analysis on more than 1600 calculated molecular descriptors. Through QSAR analysis, two predictive models were developed. The final versions produced an average prediction accuracy of 72.5 and 84.2% for the linear PLS and the non-linear ANN procedures, respectively. A test of the ANN model was performed by using it to predict the activity, i.e., staining or non-staining of amyloid fibrils, using 320 compounds. The five candidates whose greatest activities were selected by the ANN model underwent confirmation of their predicted properties by empirical testing. The results indicated that the ANN model potentially is useful for facilitating prediction of activity of untested compounds as dyes for detecting amyloid fibrils. Key words: amyloid fibrils, artificial neural network, cyanine dyes, fluorescent probe design, QSAR

Aggregation of proteins into insoluble amyloid fibrils (AF) plays a key role in the development of several neurodegenerative conditions including Alzheimer ’s and Parkinson’s diseases. Although fluorescence based assays are widely used for searching for anti-amyloidogenic agents, only a limited number of dyes currently are used for such procedures. All of the main groups of compounds, i.e., those derived from thioflavin T (Thio T)-PIB (Pittsburgh Compound B; N-methyl-4’-methylaminophenyl-6-hydroxybenzathiazole) (Klunk et al. 2004), SB-13 (Verhoeff et al. 2004, Klunk et al. 2004), Congo red-BSB (bromo-2,5-bis-(3-hydroxycarbonyl4-hydroxy)styrylbenzene) (Schmidt et al. 2001) and Correspondence: S.M. Yarmoluk, Institute of Molecular Biology and Genetics, National Academy of Sciences of Ukraine, 03143 Kyiv, Ukraine. E-mail: [email protected] © 2013 The Biological Stain Commission Biotechnic & Histochemistry 2014, 89(1): 1–7.

DOI:10.3109/10520295.2013.785593

aminonaphthalene-FDDNP (2-(1-(2-(N-(2-[18F]fluoroethyl)-N-methylamino)naphthalene-6-yl)ethylidene)malononitrile) (Agdeppa et al. 2001), bind nonspecifically to AFs (Agdeppa et al. 2003, Ye et al. 2005). We examined previously a variety of substituted poly- and monomethine dyes as potential fluorescent probes for AF detection and several cyanines with high specificity and efficiency were identified (Volkova et al. 2008). Despite the many dyes proposed for AF detection, the modes of interaction of dyes with AF remain poorly understood. Consequently, we decided to apply the QSAR methodology to seek fluorescent probes for amyloid, even though the structure of the biological site is not well defined (Voropay et al. 2003, Cooper 1974). Partial least squares (PLS) and artificial neural network (ANN) regression methods often are used for performing QSAR analysis. PLS regression is a statistical tool that generates a linear regression 1

Biotech Histochem Downloaded from informahealthcare.com by Northeastern University on 01/13/14 For personal use only.

model by projecting the predicted and observable variables onto a new set of coordinates (Geladi et al. 1986, Trygg et al. 2002). An ANN is a mathematical tool that can be used for regression and classification and was originally inspired by the neuron structure in the brain (Palyulin et al. 2000, Givehchi et al. 2004). ANN consists of a series of nodes (analogous to neurons) that have multiple connections with other nodes. Execution of the principal component analysis (PCA) technique allows a reduction of the dimensionality of the data (Boyd and Seward 1991). Earlier modeling of fluorescent probes using QSAR methodology was conducted to identify the common structural features of fluorescent dyes that are selective for endoplasmic reticulum. A classification model was developed that had good predictive characteristics based on physicochemical properties such as lipophilicity, amphipathic character, and size of the aromatic system (Colston et al. 2003). We recently performed a QSAR analysis to identify possible inhibitors of AF formation (Volkova et al. 2010), which involved optimization of the structures of inhibitors; this approach may be used for designing probes for AF. We used computational approaches including PLS and ANN procedures to build predictive models of the activity of mono- and polymethine cyanine dyes as probes for detecting AF.

Materials and methods

Organic Chemistry, National Academy of Sciences of Ukraine. Together, these compounds provided a wide diversity of molecular structures. Fluorescence studies of these compounds were performed according to published methods (Volkova et al. 2008). The following emission intensities were measured for each dye: free dye (IF), dye in the presence of monomeric α-synuclein (IM), and dye in the presence of fibrillar α-synuclein (IA). Preparing structures Before calculating the molecular descriptors, geometrical optimization of each compound was performed using the MM⫹ force field (Hocquet et al. 1998) in HyperChem (Hyperchem Inc., http:// www.hyper.com). All conformations obtained from the system conformation search led to a unique conformation with lowest free energy when optimized further using the Austin model 1 (AM1) method (Dewar et al. 1985). All compounds were considered to be only trans isomers. Molecular descriptors calculation Molecular descriptors were calculated using the E-Dragon software (Tetko et al. 2005). This program can calculate more than 1600 descriptors in six categories: constitutional descriptors, electronic descriptors, physicochemical properties, topological indices, geometrical molecular descriptors, and quantum chemistry descriptors.

Reagents Dimethyl sulfoxide (DMSO) and 10 mM TrisHCl buffer, pH 7.8, were used as solvents. These reagents and Thio T were purchased from Sigma (St. Louis, MO).

Preparation of stock solutions of dyes and inhibitor molecules Stock dye solutions were prepared by dissolving the dyes at 2 mM in DMSO (dye 7519) or in 10 mM TrisHCl buffer (Thio T). Stock 5 ⫻ 10⫺4 M solutions of selected inhibitors were prepared by dissolving the compounds in DMSO. Data sets Fifty-two polymethine dyes from in-stock compounds from library of the Institute of Molecular Biology and Genetics were studied. Another 92 polymethine dyes were provided by the Institute of 2

Biotechnic & Histochemistry 2014, 89(1): 1–7

Data pre-processing All data were subjected to pre-processing before further work was carried out. First, any descriptor that had an identical value for more than 90% of the samples was removed. Second, any descriptor with a relative standard deviation ⬍ 0.05 was removed. Finally, one of any two descriptors with an absolute Pearson correlation coefficient value above 0.9 was removed. Machine learning approaches Both PLS and ANN approaches were performed using the Virtual Computational Chemistry Laboratory on-line package (VCCLAB, http://www. vcclab.org). PLS analysis was carried out using PLSR software (Palyulin et al. 2000). ANN regression was calculated using the ASNN tool (Tetko 2002). The gradient descent with momentum was used for the training. Network topology, specifically

the number and type of neuronal layers, was chosen by maximizing the generalization ability for the test set.

Biotech Histochem Downloaded from informahealthcare.com by Northeastern University on 01/13/14 For personal use only.

Model validation We used a cross-validation method to assess the validity of the model. The model construction set was split randomly into five subsets and the following procedure was repeated for each subset. One subset was used as a test set and the other four subsets were combined to form a training set. The averaged overall accuracy, defined across all five test sets, then was computed as the generalization ability. Performance measures The performance of different classifiers was measured by sensitivity (SE), specificity (SP) and overall prediction accuracy (Q), which are defined, respectively, as: SE ⫽ TP/(TP⫹ FN), SP ⫽ TN/(TN⫹ FP), Q ⫽ (TP⫹ TN)/(TP⫹ FN⫹ TN⫹ FP), where TP, FN, TN, and FP are the number of true positives, the number of false negatives, the number of true negatives, and the number of false positives, respectively. SE and SP are the prediction accuracies for positive and negative samples, respectively (Li et al. 2005).

Table 1. Performance of PLS regression at each step of feature selection Model number

Number of descriptors

TP

r2

89 91 94 93 85

47 45 47 43 42

0.74 0.69 0.72 0.70 0.72

1 2 3 4 5

TP, number of the true positive observations.

Results and discussion We calculated 1678 descriptors, which fell into six categories as follows: constitutional descriptors, electronic descriptors, physicochemical properties, topological indices, geometrical molecular descriptors, and quantum chemistry descriptors. After the pre-processing step, the number of descriptors was decreased to 636. After performing the PLS algorithm, 162 descriptors with regression coefficients above 0.6 were selected. The compounds in the data set with fluorescent data were divided into two activity classes, i.e., active (A⫹) and inactive (A⫺). This classification of compounds was based on the following fluorescence characteristics: the emission intensities of free dye (IF), dye in the presence of monomeric α-synuclein

Table 2. Symbols and definitions of the most important descriptors in the ANN model Name ovt chi1v_C KierA1 KierA2 chi0v_C chi1 pmi TPSA rgyr Kier1 E_strain vsurf_EWmin1 ASA⫹ E_rvdw PEOE_PCb_1rotR SMR dipoleZ vdw_area b_rotN vsurf_G chi1v vsurf_S Q_VSA_FPPOS

Ratio Rank 1.01 1.01 1.01 1.01 1.01 1.01 1.01 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.98

37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Description ovality carbon valence connectivity index first alpha modified shape index second alpha modified shape index carbon valence connectivity index atomic connectivity index principal moment of inertia polar surface area radius of gyration first kappa shape index local strain energy lowest hydrophilic energy water accessible surface area with positive partial charge van der Waals interaction energy total negative partial charge fraction of rotatable single bonds molecular refractivity the z component of the dipole moment area of van der Waals surface number of rotatable bonds surface globularity atomic valence connectivity index (order 1) interaction field surface area fractional positive polar van der Waals surface area

Fluorescent dyes for amyloid 3

Table 3. Performance of the ANN regression in each step of feature selection Model number 1 2 3 4 5

Number of descriptors

TP

r2

64 60 61 62 64

44 46 45 47 43

0.84 0.89 0.86 0.84 0.82

Biotech Histochem Downloaded from informahealthcare.com by Northeastern University on 01/13/14 For personal use only.

TP, number of the true positives observations.

(IM), and dye in the presence of fibrillar α-synuclein (IA). Dyes were classed as A⫹ if the ratio of IA to IF (IA/IF) was ⬎ 20, with the ratio of IA to IM (IA/IM) ⬎ 5. Compounds with lower IA/IF and/or IA/IM values were classed as inactive (A⫺). In this way, we identified 45 A⫹ compounds and 99 A⫺ compounds in the data set. Attempts to perform a regression analysis directly resulted in models with rather poor predictive properties: r2 ⫽ 0.63 for the PLS model and r2 ⫽ 0.56 for the ANN model. Consequently, we decided to perform two phases of analysis sequentially as qualitative and quantitative predictions. Basically, qualitative (classification) analysis led to the activity class prediction, whereas in the next stage, the IA/IF value was predicted by quantitative analysis. As a result of classification analysis, we developed a model with r2 ⫽ 0.84 using ANN. Our current regression model validated 38 A⫹ dyes from an initial 45 A⫹ dyes. Based on the data, a further regression analysis was carried out for A⫹ dyes. Based on the results of the regression analysis using the PLS algorithm, we selected five models with an average prediction accuracy of r2 ⬎ 0.71 (Table 1). The moderate prediction accuracy achieved can be attributed to limitations that arose from the feature selection step when using the PLS method, because it is a linear regression method. The major limitations were a higher risk of overlooking real correlations and sensitivity to the relative scaling of the descriptor variables. Nevertheless, a clear trend of increased prediction ability of the model was observed when the number of the descriptors was reduced due to decreasing the choice ambiguity. To test whether the selected 162 descriptors truly were relevant to A⫹ compounds, the selected descriptors were used directly in model building by cross-validation using the ANN approach (Table 2). The model parameter, i.e., the number of neurons of the hidden layer for ANN, was optimized using a systematic search. Table 3 gives the 4

Biotechnic & Histochemistry 2014, 89(1): 1–7

Fig. 1. Predicted IA/IF vs. experimental values for 47 A⫹ dyes.

prediction accuracy for each test set in the cross validation. The averaged root mean square of prediction error is 0.85 for A⫹ compounds. The linear regression model was statistically valid and the PLS routine enabled an investigation of the effects of each descriptor in the model. These results indicate that the prediction accuracy level for the Table 4. Fluorescence data for 24 A⫹ dyes Number of dye 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

IF

IM

IA

1.8 2.1 4.2 2.2 1.7 5.6 0.8 32.0 1.5 0.6 1.8 1.1 5.0 1.6 1.4 0.6 1.2 0.1 4.8 13.0 1.0 1.2 38.3 0.1

2.0 2.2 3.5 2.1 2.1 4.3 1.0 26.1 1.5 0.8 2.8 1.5 4.6 1.3 0.9 0.7 0.8 0.3 4.5 8.0 0.3 2.5 28.0 1.4

65.0 13.8 24.8 2.2 2.1 45.0 11.5 32.0 9.5 12.1 4.3 12.5 26.0 12.0 4.0 4.6 9.0 4.7 9.0 28.0 14.0 5.5 44.0 0.3

IA/IF IA/IF experimental predicted 37.1 6.5 5.9 1.0 1.2 8.0 14.4 1.0 6.3 20.2 2.4 10.9 5.2 7.5 2.9 7.6 7.5 33.6 1.9 2.2 14.0 4.8 1.2 2.8

46.7 28.8 12.7 0.6 1.3 5.6 12.3 1.1 10.9 13.0 5.4 9.3 6.7 13.1 2.5 11.6 9.0 31.0 2.0 3.8 7.8 3.3 10.0 1.5

IF, emission intensity of free dye; IM, emission intensity of dye in the presence of monomeric α-synuclein; IA, emission intensity of dye in the presence of fibrillar α-synuclein.

Biotech Histochem Downloaded from informahealthcare.com by Northeastern University on 01/13/14 For personal use only.

Fig. 2. Predicted IA/IF vs. experimental values for 24 A⫹ dyes.

ANN model is greater than for the PLS model. The accuracy of these machine learning approaches may be improved further by using new informative descriptors as inputs of ANN and upgrading network topology. In general, the mechanism of probe–protein binding can be regarded as the noncovalent interaction of planar, oblong dye molecules with channels

of fibrils that are formed by the beta pleated sheet of the ligand. Consequently, topological descriptors that reflect linear dimensions (and other geometrical properties) of the dye molecule and descriptors of steric accessibility should be most predictive of activity. This hypothesis is consistent with the fact that such properties correspond to the neural network inputs with the largest weights (Table 3). The most important descriptors with regression coefficients are given here in order of their increasing effect on activity. The QSAR model developed by the ANN application has good regression properties (r2 ⫽ 0.86) for predicting an IA/IF value for A⫹ compounds. Values for IA/IF extracted from ANN model 4 agree with the experimental data as shown in Fig. 1. These results indicate that the ANN QSAR model can be used to assist design of novel imaging agents for AF. The validity of the predictive model was investigated by applying the procedure described above to a 320 additional dyes. This resulted in the identification of 24 dyes classified as A⫹. Quantitative analysis aimed at predicting IA/IF characteristics was then carried out on these dyes. Twentyfour substances with IA/IF ⬎ 1 were tested and the results are shown in Table 4. The correlation

Fig. 3. Probes for AF detection identified by the ANN method.

Fluorescent dyes for amyloid 5

Biotech Histochem Downloaded from informahealthcare.com by Northeastern University on 01/13/14 For personal use only.

between calculated and predicted IA/IF values for these A⫹ compounds is shown in Fig. 2. In this way, five cyanine dyes with properties suitable for detecting AF were identified and the structures of these dyes are given in Fig. 3. We employed the QSAR methodology for the first time to develop fluorescent probes for AF detection using imaging methods. We developed a predictive model applicable to mono- and polymethine dyes. Both PLS and ANN algorithms were used and the latter produced more accurate prediction. To confirm the validity of the ANN model, it was applied to an additional group of mono- and polymethine dyes. As a result, five cyanine dyes were identified with properties suitable for AF detection. Consequently, we believe that the ANN model is a useful method for predicting the efficacy of a dye as a fluorescent probe for AF.

Acknowledgment This work was supported by Science and Technology Center in Ukraine, Project 5281 “Development of fluorescent dyes for detection of oligomeric amyloid intermediates in neurodegenerative diseases.” Declaration of interest: The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

References Agdeppa ED, Kepe V, Liu J, Flores-Torres S, Satyamurthy N, Petric A, Cole GM, Small GW, Huang SC, Barrio JR (2001) Binding characteristics of radiofluorinated 6-dialkylamino-2-napthylethylidene derivatives as PET imaging probes of b-amyloid plaques in Alzheimer ’s disease. J. Neurosci. 21: 181–185. Agdeppa ED, Kepe V, Liu J, Small GW, Huang SC, Satyamurthy N, Petric A, Barrio JR (2003) 2-Dialkylamino6-acylmalononitrile substituted naphthalenes (DDNP analogs): novel diagnostic and therapeutic tools in Alzheimer’s disease. Mol. Imag. Biol. 5: 407–417. Boyd RD, Seward CM (1991) The substituent parameter database: a powerful tool for QSAR analysis. In: QSAR: Rational Approaches to the Design of Bioactive Compounds. Pharmacochemistry Library Series, Elsevier, Amsterdam, vol. 16. pp. 167–170. Colston J, Horobin RW, Rashid-Doubell F, Pediani J, Johal KK (2003) Why fluorescent probes for endoplasmic reticulum are selective: an experimental and QSAR-modeling study. Biotech. & Histochem. 78: 323–332. Cooper JH (1974) Selective amyloid staining as a function of amyloid composition and structure. Lab. Invest. 31: 232–238.

6

Biotechnic & Histochemistry 2014, 89(1): 1–7

Dewar MS, Zoebish EG, Healy EF (1985) Development and use of quantum mechanical molecular 76. AM1: a new general purpose quantum mechanical molecular model. J. Am. Chem. Soc. 107: 3902–3909. Geladi P, Kowalski B (1986) partial least squares regression: a tutorial. Anal. Chim. Acta 185: 1–17. Givehchi A, Schneider G (2004) Impact of descriptor vector scaling on the classification of drugs and nondrugs with artificial neural networks. J. Mol. Model. 10: 204–211. Hocquet A, Langgard M (1998) An evaluation of the MM⫹ force field. Molec. Modeling Ann. 4: 94–112. Klunk WE, Engler H, Nordberg A, Wang YM, Blomqvist G, Holt DP, Bergstrom M, Savitcheva I, Huang GF, Estrada S, Ausen B, Debnath ML, Barletta J, Price JC, Sandell J, Lopresti BJ, Wall A, Koivisto P, Antoni G, Mathis CA, Langstrom B (2004) Imaging brain amyloid in Alzheimer’s disease with Pittsburgh Compound–B. Ann. Neurol. 55: 306–319. Klunk WE, Wang YM, Mathis CA (2004) Amyloid deposits in transgenic PS1/APP mice do not bind the amyloid PET tracer, PIB, in the same manner as human brain amyloid. Neurobiol. Aging 25: 232–233. Kung MP, Hou C, Zhuang ZP, Skovronsky D, Kung HF (2004) Binding of two potential imaging agents targeting amyloid plaques in postmortem brain tissues of patients with Alzheimer ’s disease. Brain Res. 1025: 98–105. Li H, Ung CY, Yap CW, Xue Y, Li ZR, Cao ZW, Chen YZ (2005) Prediction of genotoxicity of chemical compounds by statistical learning methods. Chem. Res. Toxicol. 18: 1071–1080. Palyulin VA, Radchenko EV, Zefirov NS (2000) Molecular field topology analysis method in QSAR studies of organic compounds. J. Chem. Inf. Comp. Sci. 40: 659–667. Schmidt ML, Schuck T, Sheridan S, Kung MP, Kung H, Zhuang ZP, Bergeron C, Lamarche JS, Skovronsky D, Giasson BI, Trojanowski JQ (2001) The fluorescent Congo red derivative, (trans, trans)-1-bromo-2,5-bis-(3hydroxycarbonyl-4-hydroxy)styrylbenzene (BSB), labels diverse beta-pleated sheet structures in postmortem human neurodegenerative disease brains. Am. J. Pathol. 159: 937–943. Tetko IV (2002) Neural network studies. 4. Introduction to associative neural networks. J. Chem. Inf. Comput. Sci. 42: 717–728. Tetko IV, Gasteiger J, Todeschini R, Mauri A, Livingstone D, Ertl P, Palyulin VA, Radchenko EV, Zefirov NS, Makarenko AS, Tanchuk VY, Prokopenko VV (2005) Virtual computational chemistry laboratory-design and description. J. Comput. Aid. Mol. Des. 19: 453–463. Trygg J, Wold S (2002) Orthogonal projections to latent structures. J. Chemomet. 16: 119–128. Verhoeff NP, Wilson AA, Takeshita S, Trop L, Hussey D, Singh K, Kung HF, Kung MP, Houle S (2004) In vivo imaging of Alzheimer disease beta-amyloid with [11C]SB-13 PET. Am. J. Geriat. Psych. 12: 584–595. Volkova KD, Kovalska VB, Balanda AO, Losytskyy MY, Golub AG, Vermeij RJ,

Fink AL, Uversky VN (2003) Spectral properties of thioflavine T and its complexes with amyloid fibrils. J. Appl. Spectrosc. 70: 868–874. VCCLAB, Virtual Computational Chemistry Laboratory. http://www.vcclab.org. Ye L, Morgenstern JL, Gee AD, Hong GZ, Brown J, Lockhart A (2005) Evidence for the presence of three distinct binding sites for the thioflavin T class of Alzheimer’s disease PET imaging agents on beta-amyloid peptide fibrils. J. Biol. Chem. 280: 7677–7684.

Biotech Histochem Downloaded from informahealthcare.com by Northeastern University on 01/13/14 For personal use only.

Subramaniam V, Tolmachev OI, Yarmoluk SM (2008) Specific fluorescent detection of fibrillar α-synuclein using mono- and trimethine cyanine dyes. Bioorg. Med. Chem. 16: 1452–1459. Volkova KD, Kovalska VB, Inshyn DI, Slominskii YL, Tolmachev OI, Yarmoluk SM (2010) Novel fluorescent trimethine cyanine dye 7519 for amyloid fibril inhibition assay. Biotech. & Histochem. 86: 188–191. Voropay ES, Samtsov MP, Kaplevsky KN, Maskevich AA, Stepuro VI, Povarova OI, Kuznetsova IM, Turoverov KK,

Fluorescent dyes for amyloid 7

Development of a quantitative structure activity relations (QSAR) model to guide the design of fluorescent dyes for detecting amyloid fibrils.

Quantitative structure activity relationship (QSAR) studies were performed on a set of polymethine compounds to develop new fluorescent probes for det...
196KB Sizes 0 Downloads 0 Views