Development and implementation of (Q)SAR modeling within the CHARMMing web-user interface.

SOFTWARE NEWS AND UPDATES

WWW.C-CHEM.ORG

Development and Implementation of (Q)SAR Modeling Within the CHARMMing Web-User Interface Iwona E. Weidlich,*[a,b] Yuri Pevzner,[c] Benjamin T. Miller,[b] Igor V. Filippov,[d] H. Lee Woodcock,[c] and Bernard R. Brooks[b] Recent availability of large publicly accessible databases of chemical compounds and their biological activities (PubChem, ChEMBL) has inspired us to develop a web-based tool for structure activity relationship and quantitative structure activity relationship modeling to add to the services provided by CHARMMing (www.charmming.org). This new module implements some of the most recent advances in modern machine learning algorithms—Random Forest, Support Vector Machine, Stochastic Gradient Descent, Gradient Tree Boosting, so forth.

A user can import training data from Pubchem Bioassay data collections directly from our interface or upload his or her own SD files which contain structures and activity information to create new models (either categorical or numerical). A user can then track the model generation process and run models C 2014 Wiley Periodicals, Inc. on new data to predict activity. V

Introduction

time in preparing, analyzing, and testing them. Thus, (Q)SAR procedures reduce costs of the early drug discovery pipeline[10–12] and have a long history of use both for the industrial design and regulatory assessment of pharmaceuticals, pesticides, and other chemicals.[13,14] Further, (Q)SAR models can also be used to reduce the number of animal tests, and thus are applied in industry, where “nonanimal alternatives” are being actively sought.[15,16] In recent years, (Q)SAR modeling has found broader application in virtual screening as well as in the area of chemical risk assessment.[17] However, the quality of (Q)SAR predictions currently lags behind those of the other areas of machine learning (ML) applications, and thus there is

CHARMMing[1] (http://charmming.org) is a widely used webbased front end to the Chemistry at HARvard Macromolecular Mechanics (CHARMM) molecular simulation package[2] as well as to other molecular modeling and drug discovery software. The code is in the public domain and may be downloaded, installed, and used without any restrictions. CHARMMing’s main functionality focuses on common molecular simulation tasks such as structure preparation, energy minimization (geometry optimization), solvation, molecular dynamics, and normal mode analysis. CHARMMing also has robust and interactive visualization capabilities, provided by the GLmol[3] and JSmol[4] software packages. One of the features that differentiate CHARMMing from other tools is its support for multiscale calculations using both classical and quantum methods; a graphical set up utility for multiscale calculations is available. A recent focus of CHARMMing development has been on its extension as an academic drug discovery platform, the current manuscript marks an important step in this direction. Drug discovery and development are broad fields that blend computer science, biology, and chemistry with the goal of identifying correlations between chemical structures and biological effects. A major part of computational design and discovery efforts, is focused on developing structure activity relationships (SAR)[5] and quantitative structure activity relationship (QSAR) models.[6] SAR and QSAR models are used to predict biological activity of a molecule based on its chemical structure. This area of computational chemistry has been traditionally used as a lead optimization approach in drug discovery research to improve certain features such as probability of activity, drug likeness, and absorption, distribution, metabolism, and excretion properties.[7–9] Specifically, SAR and QSAR allow compounds that lack desirable or possess undesirable characteristics to be excluded from larger sets before investing 62

Journal of Computational Chemistry 2015, 36, 62–67

DOI: 10.1002/jcc.23765

[a] I. E. Weidlich Computational Drug Design Systems (CODDES) LLC, Rockville, Maryland 20852 E-mail: [email protected] [b] I. E. Weidlich, B. T. Miller, B. R. Brooks Laboratory of Computational Biology, NIH, National Heart, Lung, and Blood Institute, Rockville, Maryland 20852 [c] Y. Pevzner, H. Lee Woodcock Department of Chemistry, University of South Florida, Tampa, Florida 33620 [d] I. V. Filippov VIF Innovations, LLC, Rockville, Maryland 20852 Author Contributions: The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. Notes: CHARMMing greatly values the privacy of users submitting their data. This information will be kept confidential and secure, abiding by US governmental IT resource policies. The scripts used in our framework are available for download at: https://charmming.googlecode. com/svn/branches/qsar2/qsar. Contract grant sponsor: Intramural Research Program of the National Heart, Lung and Blood Institute of the National Institutes of Health; Contract grant sponsor: NIH; Contract grant number: 1K22HL08834101A1; Contract grant sponsor: University of South Florida C 2014 Wiley Periodicals, Inc. V

WWW.CHEMISTRYVIEWS.COM


WWW.C-CHEM.ORG

Table 1. Description of algorithms used in the CHARMMing QSAR module.[30] Algorithms

C[a]

R[b]

Random forest

x

x

SVM

x

x

Decision tree

x

x

Logistic classification

x

Stochastic gradient descent Naive Bayes

x x

x

Gradient tree boosting

x

x

Ridge regression

x

Lasso regression Elastic Net

x x

Description RF is a powerful ensemble method. Each individual classifier is a decision tree acting on a randomly selected subset of features. The final result is obtained by combining the classifiers by averaging their probabilistic prediction. Support vector machines are a set of effective nonlinear regression and classification methods in high dimensional spaces. DT is a nonparametric supervised learning method which attempts to fit the data with a set of ifthen-else decision rules. Logistic classification is a linear model which minimizes a “hit or miss” cost function. It is also known as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. SGD is an optimization approach used with linear classifiers under convex loss functions. NB is based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features. GTB is an ensemble method which is a generalization of boosting to arbitrary differentiable loss functions. It is based on Decision tree weak classifiers, the same as Random Forest method. RR is a generalization of linear least squares approach. It imposes a penalty on the sum of the squares of the coefficients. LR is similar to RR but the penalty is calculated from L1 norm of the vector of the coefficients. EN is a combination of LR and RR methods with a trade-off between L1 and L2 norms of the vector of the coefficient.

[a] Classification. [b] Regression.

a certain room for improvement within this field of cheminformatics.[18] Two main ways of improvement are higher quality experimental data and more robust data mining approaches.[19] We have previously developed SAR models using two advanced ML classifiers: random forest (RF)[20,21] and k Nearest Neighbor Simulated Annealing.[22] We identified novel nonnucleoside chemical motifs and Candesartan cilexetil (a drug used to treat hypertension and heart failure)[23] for NS5B, hepatitis C virus RNA polymerase activity. Herein, we describe the development of a general (Q)SAR infrastructure based around the CHARMM web-user interface, charmming.org to be used to build these types of models. A web-based graphical user interface (GUI) service to develop similar SAR models has been implemented and integrated with CHARMMing’s tools for uploading structures, performing simulations, and viewing the results. Further, well-known ML techniques such as cross validation[24] and y-randomization[25] have been implemented so users can immediately see whether the created model is able to calculate valid predictions. This is an important and often overlooked step in (Q)SAR modeling. There exist other similar services that offer web-based (Q)SAR model creation. We can name in particular ChemBench[26] and OChEM.[27] ChemBench offers a convenient user interface and a way to set up cross validation, as well as track submitted jobs. However, it has only 4 ML methods (RF,[20,21] support vector machine,[28] kNN-SA,[22]and kNN-GA[29]). It also seems to be quite slow (a submitted job can take hours and even days to complete) and requires some data preparation before use–the activities and structures have to be uploaded as separate files. Conversely, the charmming.org QSAR functionality is very simple to use and comparatively much faster. A structure data (SD) file with activity can be directly uploaded with little to no preparation necessary. The latest version of the tool also allows transparent import of training data from

Pubchem Bioassay database. It offers 15 different algorithms described in the Table 1. OChEM is another web-based tool. The selection of models is also more limited than what we offer. The tool only accepts excel spreadsheets to input the data, SD files cannot be uploaded. It should also be noted that in contrast to such tools as KNIME,[31] Orange,[32] Weka,[33] Accelrys Pipeline Pilot,[34] our service does not require installation of any software or plugins, or workflow programming.

Materials and Methods To create a predictive (Q)SAR model, the module uses an inhouse developed Python script based on the RDKit[35] RF ML[20,21] and chemistry modules (ML and Chem) and the scikitlearn library of ML tools.[36] RDKit is an open source toolkit for cheminformatics and ML written in C11 and Python.[37] This (Q)SAR module uses 2048-bit Morgan fingerprints[38,39] of radius 2 as features. A Morgan fingerprint is a bit vector where the bits are set according to molecular neighborhoods of atoms. It is similar to Extended Connectivity fingerprints used in Accelrys Pipeline Pilot[40] and Multilevel Neighborhoods of Atoms (MNA) descriptors used in Prediction of Activity Spectra for Substances (PASS).[41] Morgan fingerprints have been shown to be a powerful tool for QSAR modeling in the literature before.[38] CHARMMing (Q)SAR infrastructure The primary aims of CHARMMing are twofold: first, to automate routine and complex tasks often performed by CHARMM users and second, to provide a graphical, easy to use tool that newcomers can use to help them learn about various topics relating to CHARMM usage and molecular simulation in general. CHARMMing is written in Python using the Django Web framework.[1] Individual segments of functionality are Journal of Computational Chemistry 2015, 36, 62–67

63


WWW.C-CHEM.ORG

implemented as Django applications with considerable capability for adding new features. The (Q)SAR tool sets up jobs via a menu driven web interface (WI). The WI was designed to be easy to use yet powerful with the computationally intensive part of jobs submitted to and run on backend dedicated resources (e.g., Beowulf clusters like LoBoS[42]). To accomplish this, support for CHARMMing’s scheduler daemon was added to the (Q)SAR module, that is, time consuming jobs get submitted to one or more portable batch system (PBS) (or Torque)[43] jobs. Whether running prediction jobs on an existing model will take less time per structure than training a new model is strongly dependent on the size of the data sets used for both calculations. Generally, the files for predictions are (should be) much larger than the training data. It is rare to have training data for more than 10,000 molecules, but it is quite common to do predictions for millions of molecules. This way the prediction time can be significant, so both training and prediction jobs are submitted to the queuing system. All jobs are saved in the database and the outputs (i.e., predicted results) are saved in the user’s history and may be accessed at any time in the future. The users can keep track of the status (e.g., running, complete) of their jobs through the interface.

To use the (Q)SAR service through CHARMMing, the user must first prepare and upload the dataset, then train the model, and finally use the model for further predictions. An introduction to the use of the SAR and QSAR application presented here is described in more detail in the form of tutorials located at www.charmmtutorial.org.[1,44–46] These tutorials provide a guided example with a detailed explanation of each step of the process and a definitely correct answer; the same data set used for the study of NS5B[23] is currently used.

The user should prepare the original data with activity expressed as a binary value (e.g., Active/Inactive, Y/N, or any other pair of values) for SAR or a numerical value (e.g., IC50–all values should be expressed in the same units: nM, mM etc.) for QSAR. Additionally, one can use the Pubchem Bioassay search interface to query Pubchem database for relevant biological assay data via PUG REST and SOAP interfaces.[47–52] A simple search box is presented, the result of a successful search is a list of assays with Assay ID (AID) identifiers, short descriptions, and the number of compounds tested for each assay. The AID identifier is a clickable hyperlink which leads to the corresponding assay web page at Pubchem where a user can find more complete information about the assay. By selecting checkboxes with assays of interest a user can either download the resulting combined dataset as an SD file or transparently submit it to SAR model training procedure. There is an optional filter to remove molecules with unspecified or inconclusive assay results. One of the limiting points of (Q)SAR models reliability of predictions is the applicability domain.[53] While building (Q)SAR models within the CHARMMing web-user interface the user has the freedom to make a judgment about the similarity of molecules in the training set and the molecules he/she would like to make predictions for offline. We are not directly calculating the applicability domain but it is possible to visualize the mutual distribution, and similarity/diversity of the training set molecules versus any other set of molecules using tools such as, for example, Diversity Genie.[54] It allows to visualize both sets of molecules on a Sammon’s map—a two dimensional projection which aims to preserve the intermolecule distances. It is recommended that the user estimates the diversity for the whole training set, as well as for the active and inactive subsets separately before uploading. Two of the authors have recently developed a software tool for chemical diversity analysis which was mentioned above,[54] alternative approaches have also been studied earlier.[55–59]

Preparing and uploading the data

Training procedure

Framework and Discussion

There are currently two ways to input training data. The first is by uploading a user’s own SD file with structures and activities.

During the training procedure, Morgan fingerprints are calculated for the whole set, and a SAR or QSAR model is built. A user is presented with area under curve (AUC)[60] measurements for the self prediction, for the y-randomized set, and an average AUC for 5-fold cross validation for categorical modeling. Self prediction means that the prediction is run on the same file used for training. Precision[61] and recall (Fraction of predicted real active compounds among the predicted active compounds [when recommended threshold is applied to the preFigure 1. SAR categorization training model results for the HCV RNA polymerase (gene product NS5B).[23] dicted score]) are also

64



WWW.C-CHEM.ORG


Prediction procedure To run the prediction procedure, the user needs to submit a new SD file Figure 2. SAR categorization prediction model results for the HCV RNA polymerase (gene product NS5B).[23] through the submission panel. The user can validate the training model on an external presented. The Pearson correlation coefficient R2 is used for validation set. If the SD file submitted for prediction contains the same purpose for regression models.[62] A user should existing values for the activity property, CHARMMing will calkeep an eye on the output panel to verify the results that we culate and display precision and recall measures or R2. Models explain and discuss below. are saved and made available to the user to run on additional A well-known problem of some computational models, and in external data sets (Fig. 2). particular ML approaches, is over fitting the data[63]; that is instead The job/model tracking interface allows the user to monitor of making predictions, they merely “recall” the compounds very submitted jobs and created models. similar to those seen in the training set. To detect potential over The new (Q)SAR tool can be used as a stand-alone utility or fitting we use y-randomization which is a method used in validacombined with molecular docking techniques as a postprotion of ML models.[64–69] The performance of the original model in [71] cessing procedure. It is always imperative that the (Q)SAR data description (AUC or R2) is compared to that of models built computational prediction is supported by thorough experiusing a randomly shuffled response, based on the original descripmental testing. tor pool and model building procedure. AUC or R2 values, when y-randomization is used, should be close to 50% and 0 respecConclusions and Outlook tively. If these values are close to the self-AUC or cross-validation values, the model over fits and is unsuitable for use. Note that in A new (Q)SAR module has been developed within CHARMMing, our case y-randomization is performed a single time only, which a graphical WI to CHARMM. Further, multiple advanced ML in practical circumstances is usually sufficient to flag an obviously methods have been implemented with detailed instructions for over-fitting model but does not constitute a robust hypothesis creating new models (www.charmmtutorial.org). This module test as described by Shen et al.[70] can be used as a stand-alone utility or as a supporting filter for Cross validation is a common method used to validate a other modeling procedures. It is expected that this new web (Q)SAR model. In cross validation, some compounds are held service will assist both novice and expert scientists in the field. out as a test set, while the remaining compounds form a trainFree web services based on open source technology, such ing set. In our case, the 5-fold cross validation is performed by as the one described herein, have the potential to dramatically randomly splitting the initial training set into five equal parts. change research strategies of academic research groups and Four parts (80%) are taken as the training set and one part nonprofit organizations. For example, it can be used within (20%) is taken as the test set in turn. This process is repeated the context of Teach-Discover-Treat initiatives, which specififive times so each 20% subset acts as a test set once. The cally emphasize the use of freely available software tools and average AUC is reported. Cross validation is the best indicator the reproducibility of the results.[72] One of the 2014 Teachof the model’s predictive ability in the absence of external test Discover-Treat (TDT) challenges deals with malaria–a serious sets of experimental results. High cross validation and low ydisease that according to the World Health Organization randomization are indicators of a good model, although by no (WHO) kills up to 1.2 million people a year. Another example means a guarantee of one. of potential use case is Tox21 data challenge[73] which aims to A prediction score as well as “active/inactive” labels and the bring data analysis to toxicity prediction, a problem which is recommended threshold are displayed as the output of the important not only to drug design but to a wider area of prediction (Fig. 1). For SAR prediction most of the methods study of life chemistry in general. We hope that our freely return a numerical “score”; to get a binary prediction a threshavailable (Q)SAR tool can aid in this and similar work. old must be chosen to separate actives versus inactives. The prediction threshold is automatically recommended based on Acknowledgments the balance between recall and precision values–that is the precision and recall are as close in value as possible. The origiThe authors acknowledge Dr. Paul A. Thiessen, Dr. Evan Bolton, and nal score is also available for the model runs, so that a user his Team for valuable discussions and for providing core expertise in can select his/her own threshold if desired. the area of PubChem. This research was supported in part by the SAR categorization training model attributes shown in FigIntramural Research program of the National Heart, Lung, and ure 1 indicate a higher value of cross validation and lower Blood Institute, NIH. IEW, BTM, and BRB utilized the highvalue of y-randomization. The y-randomization result is lower performance computational capabilities of the LoBoS (http://www. than self-AUC or cross validation. These values mean that the lobos.nih.gov/) and Biowulf Linux clusters (http://biowulf.nih.gov) model can predict its training values, does not over fit the at the National Institutes of Health. data, and is suitable for use. The recommended threshold for this training model is 0.0435, which means that compounds Keywords: CHARMMing SAR QSAR machine learning ranwith a score above this value might be considered active. dom forest Journal of Computational Chemistry 2015, 36, 62–67

65


WWW.C-CHEM.ORG

How to cite this article: I. E. Weidlich, Y. Pevzner, B. T. Miller, I. V. Filippov, H. L. Woodcock B. R. Brooks J. Comput. Chem. 2015, 36, 62–67. DOI: 10.1002/jcc.23765

[1] B. T Miller, R. P. Singh, J. B. Klauda, M. Hodoscˇek, B. R Brooks, H. L. Woodcock, J. Chem. Inf. Model. 2008, 48, 1920. [2] B. R. Brooks, C. L. Brooks, A. D. Mackerell, J. Comp. Chem. 2009, 30, 1545. [3] http://webglmol.sourceforge.jp/index-en.html, Accessed on August 11, 2014. [4] R. M. Hanson, J. Prilusky, Z. Renjian, T. Nakane, J. L. Sussman, Isr. J. Chem. 2013, 53, 207. [5] Wikipedia, Structure–activity relationship, http://en.wikipedia.org/wiki/ Structure-activity_relationship, Accessed on October 23, 2014. [6] Wikipedia, Quantitative structure–activity relationship, http://en.wikipedia.org/wiki/Quantitative_structure-activity_relationship, Accessed on October 23, 2014 [7] G. Kalyani, D. Sharma, Y. Vaishnav, V. S. Deshmukh, IJPRD, 2013, 5, 015. [8] P. Csermely, T. Korcsmaros, H. J. M. Kiss, G. London, R. Nussinov, Pharmacol. Ther. 2013, 138, 333. [9] K. H. Bleicher, H. J. B€ ohm, K. M€ uller, A. I. Alanine, Nat. Rev. Drug Discov. 2003, 2, 369. [10] P. Bamborough, D. Drewry, G. Harper, G. K. Smith, K. Schneider, J. Med. Chem. 2008, 51, 24, 7898. [11] S. V. Frye, Chem. Biol. 1999, 6, R3. [12] F. Lovering, J. Bikker, C. Humblet, J. Med. Chem. 2009, 52, 6752. [13] M. Manubusan, J. Paterson, R. Kent, J. Chen, North American Free Trade Agreement (NAFTA). Technical Working Group on Pesticides (TWG). (Quantitative) Structure Activity Relationship [(Q)SAR]. Guidance Document, http://www.epa.gov/oppfead1/international/naftatwg/ guidance/qsar-guidance.pdf, Accessed on October 23, 2014. [14] J. D. McKinney, A. Richard, C. Waller, M. C. Newman, F. Gerberick, Toxicol Sci 2000, 56, 8. [15] M. B Kapis, S. C. Gad, Non-Animal Techniques in Biomedical and Behavioral Research and Testing; Lewis Publishers (CRC Press), Florida, 1993, Chapter 3, pp 36. [16] K. Gade, ScienceNordic, 2014, 6,18. [17] T. Puzyn, P. T. Leszczynski, J. Cronin, In Theoretical and Computational Chemistry. Recent Advances in QSAR studies. Methods and Applications; Springer: Dordrecht, 2010; p. 8. [18] S. R. Johnson, JCIM, 2008, 48, 25. [19] S. Garg, V. Sharma, VSRD-IJBPS, 2012, 1, 1. [20] Wikipedia, Random forest, http://en.wikipedia.org/wiki/Random_forest, Accessed on October 23, 2014. [21] L. Breiman, Random Forests, http://oz.berkeley.edu/~breiman/randomforest2001.pdf, Accessed on October 23, 2014. [22] C. Yang, Y. Li, C. Zhang, Y. Hu, A Fast KNN Algorithm Based on Simulated Annealing, http://wwwmath.uni-muenster.de/u/lammers/EDU/ ws07/Softcomputing/Literatur/1-DMI5460.pdf, Accessed on October 23, 2014. [23] I. E. Weidlich, I. V. Filippov, J. Brown, N. Kaushik-Basu, R. Krishnan, M. C. Nicklaus, I. F. Thorpe, Bioorg. Med. Chem. 2013, 21, 3127. [24] Wikipedia, Cross-validation (statistics), http://en.wikipedia.org/wiki/Crossvalidation_%28statistics%29, Accessed on October 23, 2014. [25] http://www.mathe2.uni-bayreuth.de/markus/pdf/pub/YRandQsar.pdf, Accessed on August 11, 2014. [26] CHEMBENCH, https://chembench.mml.unc.edu/, Accessed on October 23, 2014. [27] Online chemical database with modeling environment, https://ochem. eu/home/show.do, Accessed on October 23, 2014. [28] Wikipedia, Support vector machine, http://en.wikipedia.org/wiki/Support_vector_machine, Accessed on October 23, 2014. [29] L. Li, T. A. Darden, C. R. Weingberg, A. J. Levine, L. G. Pedersen, Comb. Chem. High Throughput Screen. 2001, 4, 727. [30] Scikit learn, supervised learning, http://scikit-learn.org/stable/user_ guide.html, Accessed on October 23, 2014. [31] Open for Innovation, KNIME, http://www.knime.org/, Accessed on October 23, 2014. [32] Orange, Data-Mining Fruitful and Fun, http://orange.biolab.si/, Accessed on October 23, 2014.

66


[33] WEKA, The University of Waikato, Weka 3: Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/index.html, Accessed on October 23, 2014. [34] Biovia Foundation, Biovia Pipeline Pilot Overview, http://accelrys.com/ products/pipeline-pilot/, Accessed on October 23, 2014. [35] RDKit: Cheminformatics and Machine Learning Software, http://rdkit. org/, Accessed on October 23, 2014. [36] Sci-kit learn. Machine learning in Python, http://scikit-learn.org/stable/, Accessed on October 23, 2014. [37] Rdkit. A toolkit for cheminformatics and machine learning, http://code. google.com/p/rdkit/, Accessed on October 23, 2014. [38] D. Rogers, M. Hahn, JCIM, 2010, 50, 742. [39] D. Rogers, R. D. Brown, M. J. Hahn, Biomol. Screen. 2005, 10, 682. [40] Accelrys, Chemistry Component Collection, http://accelrys.com/products/ datasheets/chemistry-collection.pdf, Accessed October 23, 2014. [41] A. Lagunin, A. Stepanchikova, D. Filimonov, V. Poroikov, Bioinformatics Applications Note, Pass: prediction of activity spectra for biologically active substances, http://bioinformatics.oxfordjournals.org/content/16/8/ 747.full.pdf, Accessed on October 23, 2014. [42] The Lobos Cluster, Laboratory of Computational Biology, http://www. lobos.nih.gov/, Accessed on October 23, 2014. [43] Adaptive Computing, TORQUE RESOURCE MANAGER, http://www. adaptivecomputing.com/products/open-source/torque/, Accessed on October 23, 2014. [44] CHARMM Tutorial, SAR and QSAR Introduction, http://charmmtutorial. org/index.php/SAR_and_QSAR_Introduction, Accessed on October 23, 2014. [45] CHARMM Tutorial, SAR Categorization Lesson, http://charmmtutorial. org/index.php/SAR_Categorization_Lesson, Accessed on October 23, 2014. [46] CHARMM Tutorial, QSAR Regression Lesson, http://charmmtutorial.org/ index.php/QSAR_Regression_Lesson, Accessed on October 23, 2014. [47] PUG REST, https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST. html, Accessed on October 23, 2014. [48] PubChem Bioassay, https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi, Accessed on October 23, 2014. [49] Y. Wang, J. Xiao, T. O. Suzek, J. Zhang, J. Wang, S. H. Bryant, Nucleic Acids Res. 2009, 37, W623. [50] Y. Wang, E. Bolton, S. Dracheva, K. Karapetyan, B. A. Shoemaker, T. O. Suzek, J. Wang, J. Xiao, J. Zhang, S. H. Bryant, Nucleic Acids Res. 2010, 38, D255. [51] Q. Li, Y. Wang, S. H. Bryant, Bioinformatics 2009, 25, 3310. [52] X. Q. Xie, J. Z. Chen, J. Chem. Inf. Model. 2008, 48, 465. [53] J. Jaworska, Review of Methods for Assessing the Applicability Domains of SARS and QSARS, https://eurl-ecvam.jrc.ec.europa.eu/laboratoriesresearch/predictive_toxicology/information-sources/qsar-document-area/ applicability_domain_overview.pdf, Accessed on October 23, 2014. [54] Diversity Genie, http://www.diversitygenie.com/, Accessed on October 23, 2014. [55] T. I. Oprea, J. Gottfries, J. Comb. Chem., 2001, 3, 157. [56] J. Larsson, J. Gottfries, S. Muresan, A. Backlund, J. Nat. Prod. 2007, 70, 789. [57] N. E. Shemetulskis, D. Weininger, C. J. Blankley, J. J. Yang, C. Humblet, J. Chem. Inf. Comput. Sci. 1996, 36, 862. [58] Library Design: Molecular Diversity, Diverse Solutions, Tripos, http://tripos.com/data/SYBYL/DiverseSolutions_072505.pdf, Accessed on October 23, 2014. [59] T. I. Opera, In Chemoinformatics in Drug Discovery, Wiley-VCH Verlag GMBH&CO KGAA: Weimheim, 2005; Chapter 23, p. 330. [60] Wikipedia, Receiver operating characteristic, http://en.wikipedia.org/ wiki/Receiver_operating_characteristic, Accessed on October 23, 2014. [61] Wikipedia, Precision and recall, http://en.wikipedia.org/wiki/Precision_ and_recall, Accessed on October 23, 2014. [62] Wikipedia, Pearson product-moment correlation coefficient, http://en. wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient, Accessed on October 23, 2014. [63] A. Tropsha, Mol. Inf. 2010, 29, 476. [64] A. Kavatcheva, A. Golbraikh, S. Oloff, Y. Xiao, W. Zheng, P. Wolschann, G. Buchbauer, A. Tropsha, J. Chem. Inf. Comput. Sci. 2004, 44, 582. [65] A. Asikainen, J. Ruuskanen, K. A. Tuppurainen, Environ. Sci. Technol. 2004, 38, 6724. [66] H. Kubinyl, F. A. Hamprecht, T. Mietzner, J. Med. Chem. 1998, 41, 2553. [67] R. G. Karki, V. M. Kulkarni, Bioorg. Med. Chem. 2001, 9, 3153.


WWW.C-CHEM.ORG

[68] A. Tropsha, P. Gramatica, V. K. Gombar, QSAR Comb. Sci. 2003, 22, 69. [69] G. Klopman, A. N. Kalos, J. Comput. Chem. 1985, 6, 492. [70] M. Shen, A. LeTiran, Y. Xiao, A. Golbraikh, H. Kohn, A. Tropsha, J. Med. Chem. 2002, 45, 2811. [71] A. E. Klon, M. Glick, M. Thoma, P. Acklin, J. W. Davies, J. Med. Chem, 2004, 47, 2743. [72] TDT Teach-Discover-Treat, http://www.teach-discover-treat.org/, Accessed on October 23, 2014.


[73] Tox21 Data Challenge 2014, https://tripod.nih.gov/tox21/challenge/, Accessed on October 23, 2014.

Received: 12 August 2014 Revised: 3 October 2014 Accepted: 10 October 2014 Published online on 3 November 2014


67

Fragment-based docking: development of the CHARMMing Web user interface as a platform for computer-aided drug design.

QSAR modeling of imbalanced high-throughput screening data in PubChem.

Development and implementation of a coupled computational muscle force optimization bone shape adaptation modeling method.

QSAR based predictive modeling for anti-malarial molecules.

Plant genome-scale modeling and implementation.

Probing the origins of human acetylcholinesterase inhibition via QSAR modeling and molecular docking.

Development of QSAR for antimicrobial activity of substituted benzimidazoles.

[Role of tanycytes within the blood-hypothalamus interface].

Multispecies QSAR modeling for predicting the aquatic toxicity of diverse organic chemicals for regulatory toxicology.

Combining QSAR Modeling and Text-Mining Techniques to Link Chemical Structures and Carcinogenic Modes of Action.

Novel chemical scaffolds of the tumor marker AKR1B10 inhibitors discovered by 3D QSAR pharmacophore modeling.

Applying quantitative structure-activity relationship (QSAR) methodology for modeling postmortem redistribution of benzodiazepines and tricyclic antidepressants.

Exploration of Novel Inhibitors for Bruton's Tyrosine Kinase by 3D QSAR Modeling and Molecular Dynamics Simulation.

SketchBio: a scientist's 3D interface for molecular modeling and animation.

3D-QSAR modeling and molecular docking study on Mer kinase inhibitors of pyridine-substituted pyrimidines.

Hybridizing Feature Selection and Feature Learning Approaches in QSAR Modeling for Drug Discovery.

Development and Implementation of Sepsis Alert Systems.

Development and implementation of guidelines in neurosurgery.

Development and implementation of papillomavirus prophylactic vaccines.

Tuning HERG out: antitarget QSAR models for drug development.

Experimental Errors in QSAR Modeling Sets: What We Can Do and What We Cannot Do.

Modeling CNS Development and Disease.

Design of the influenza virus inhibitors targeting the PA endonuclease using 3D-QSAR modeling, side-chain hopping, and docking.

Dataset of curcumin derivatives for QSAR modeling of anti cancer against P388 cell line.