40
Int. J. Computational Biology and Drug Design, Vol. 8, No. 1, 2015
In silico prediction of anti-malarial hit molecules based on machine learning methods Madhulata Kumari* Department of Information Technology, Kumaun University, SSJ Campus, Almora, Uttarakhand 263601, India Email:
[email protected] *Corresponding author
Subhash Chandra Department of Botany, Kumaun University, SSJ Campus, Almora, Uttarakhand 263601, India Email:
[email protected] Abstract: Machine learning techniques have been widely used in drug discovery and development in the areas of cheminformatics. Aspartyl aminopeptidase (M18AAP) of Plasmodium falciparum is crucial for survival of malaria parasite. We have created predictive models using weka and evaluated their performance based on various statistical parameters. Random Forest based model was found to be the most specificity (97.94%), with best accuracy (97.3%), MCC (0.306) as well as ROC (86.1%). The accuracy and MCC of these models indicated that they could be used to classify huge dataset of unknown compounds to predict their antimalarial compounds to develop effective drugs. Further, we deployed best predictive model on NCI diversity set IV. As result we found 59 bioactive anti-malarial molecules inhibiting M18AAP. Further, we obtained 18 non-toxic hit molecules out of 59 bioactive compounds. We suggest that such machine learning approaches could be applied to reduce the cost and length of time of drug discovery. Keywords: machine learning; data mining; Weka; Random Forest; Naïve Bayes; J48; toxicity prediction; malaria; drug discovery. Reference to this paper should be made as follows: Kumari, M. and Chandra, S. (2015) ‘In silico prediction of anti-malarial hit molecules based on machine learning methods’, Int. J. Computational Biology and Drug Design, Vol. 8, No. 1, pp.40–53. Biographical notes: Madhulata Kumari is a PhD research scholar in the Department of Information Technology, Kumaun University, SSJ Campus, Almora, Uttarakhand, India. She received her Master degree in Information Technology from Punjab Technical University, India. Her research interests include C++ programming, statistical analysis, data mining, machine learning, molecular docking, pharmacophore modelling, 3D-QSAR modelling, lead optimisation and in silico ADMET prediction and drug design.
Copyright © 2015 Inderscience Enterprises Ltd.
In silico prediction of anti-malarial hit molecules
41
Subhash Chandra is an Assistant Professor in the Department of Botany, Kumaun University SSJ Campus, Almora, Uttarakhand, India. He received his PhD degree in Biotechnology from Jawaharlal Nehru University, India in 2006 and MSc degree in Biotechnology from Dr. B.R. Ambedkar University, Agra in 1999. His research interests include vaccine development, data mining, drug discovery, artificial intelligence and computational biology. His work has been published in various peer-reviewed journals. This paper is a revised and expanded version of a paper entitled ‘Predictive modelling of anti-malarial compounds by machine learning techniques’ presented at National Conference on Recent Advances in Statistical and Mathmetical Sciences and Their Applications, Kumaun University, Nainital, India, 04–06 October, 2014.
1
Introduction
According to the recent World Malaria Report 2013, there are 97 countries and territories with ongoing malaria transmission and seven countries in the prevention of re-introduction phase, making a total of 104 countries and territories in which malaria is presently considered endemic (World Health Organization, 2013a). Globally, an estimated population of 3.4 billion people is at risk of malaria. WHO estimates that 207 million cases of malaria and 627000 deaths (uncertainty range 473,000–789,000) occurred globally in 2012 (World Health Organization, 2013b). Most cases (80%) and deaths (90%) occurred in Africa, and most of the deaths (77%) were in children under 5 years of age (http://apps.who.int/bookorders/anglais/detart1.jsp?codlan=1&codcol=15 &codcch=6740) . In recent years, P. falciparum resistance to Artemisinins has been detected in four countries of the Greater Mekong subregion: Cambodia, Myanmar, Thailand and Vietnam (http://www.malariaconsortium.org/media-downloads/207/ Technical%20brief:% 20antimalarial %20drug%20resistance). Owing to the facts stated, there is an urgent need to increase funding for malaria control and to expand programme coverage, in order to meet international targets for reducing malaria cases and deaths. Malaria is caused by protozoan parasites belonging to the genus Plasmodium. P. falciparum, P. vivax, P. ovale and P. malariae are the four species of the parasite which are the main causative agents in humans; however, P. falciparum is the most commonly encountered and deadliest amongst them ( Newton et al., 1998). There are treatments already existing for malaria; however, occurrence of resistance to antimalarial drugs is a serious problem. The resistance of P. falciparum to previous generations of medicines, such as chloroquine and sulfadoxine-pyrimethamine, became wide spread in the 1970s and 1980s, undermining malaria control efforts and reversing gains in child survival. In recent years, parasite resistance to Artemisinins has been detected in many countries (http://www.who.int/mediacentre/factsheets/fs094/en/). WHO recommends the routine monitoring of anti-malarial drug resistance and supports countries to strengthen their efforts in this important area of work. More comprehensive recommendations are available in the WHO Global Plan for Artemisinin Resistance Containment, which was released in 2011 (http://www.who.int/malaria/publications/atoz/ artemisinin_resistance_containment_2011.pdf).
42
M. Kumari and S. Chandra
M18AAP is the sole Aspartyl aminopeptidase (AAP) present in the genome of the malaria parasite (Wilk et al., 1998). Studies have shown that genetic knockdown of P. falciparum M18AAP results in a lethal parasite phenotype (Teuscher et al., 2007) and that inhibitors of methionine (Chen et al., 2006) and leucine aminopeptidases (NankyaKitaka et al., 1998; Stack et al., 2007) prevent malaria growth in culture and haemoglobin degradation, suggest that these enzymes are essential for parasite survival. The identification of selective inhibitors of P. falciparum M18AAP elucidated (http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=1822) that this enzyme plays a role in the P. falciparum lifecycle and could serve as a potential target to develop potential therapeutic agents to control malaria infection. Predictive modelling based on machine learning algorithms is a valuable method to find out new drug candidates faster. It has been used to create models for several diseases like tuberculosis ( Periwal et al., 2011, 2012), malaria ( Jamal and Periwal, 2013), leishmaniasis ( Jamal, 2013) and predicting the phospholipidosis inducing potential ( Ivanciuc, 2008) . In the present study, we have created predictive models using high- throughput screening data of antimalarial compounds that inhibit the activity of the M18AAP in the malaria parasite, P. falciparum using machine learning techniques. Further, we predicted activity of unlabelled dataset; NCI diversity datasets IV containing 1596 compounds. Our predictive model suggests that it can be used to prioritise hit molecules against malarial disease to predict drug-like molecules.
2
Materials
2.1 Performance of high-throughput assay The dataset of anti-malarial compounds AID 1822 based on cell-based assay was obtained from PubChem database maintained by National Centre for Biotechnology Information (NCBI) (Wang et al., 2009; http://pubchem.ncbi.nlm.nih.gov/assay/ assay.cgi?aid=1822). The assay was primary biochemical high-throughput screening of the compounds assayed by fluorescence-based method to measure t h e growth and replication of P. falciparum. The purpose of the assay was to identify inhibitors of M18 Aspartyl aminopeptidase (PFM18AAP), the sole Aspartyl aminopeptidase (AAP) present in P. falciparum, reported to be essential for the survival of the parasite.
2.2 Classification of compounds The compounds inhibiting the enzyme were categorised using a mathematical algorithm and two values were computed, the average percent inhibition of all the compounds plus three times their standard deviation. A cut-off parameter was set and the compounds having the sum of average percent inhibition as well as standard deviation greater than the cut-off value were considered as active. The compounds having negative percent inhibition were assigned an activity score of 0. The dataset, AID 1822, contained a total of 290,731 tested compounds. Compounds having a PubChem activity score from 28 to 100 were considered as active (N = 3498), and all compounds having a score from 0 to 28 were considered as inactive (N = 287,235). The compounds from the active and inactive datasets were downloaded in structural data format (SDF).
In silico prediction of anti-malarial hit molecules
3
43
Computational methodology
3.1 Pre-processing of data and descriptor generation Since the number of compounds was large enough to generate descriptors in one run and to handle memory exception in PowerMV, the large datasets in SDF were split into smaller SDF files using SplitSDFiles Perl script available from Mayachem tools (Sud, 2012) . Further, the freely available descriptor generation and molecular visualisation software, PowerMV, was used to generate 2D molecular descriptors of active and inactive datasets (Liu et al., 2005). A total of 179 2D descriptors were calculated among which 147 belonged to pharmacophore fingerprints while 24 belonged to weighted burden numbers and 8 were property descriptors.
3.2 Weka machine learning toolkit In the present study, we have used Weka data mining toolkit (version 3.6) which stands for Waikato Environment for Knowledge Analysis developed by the University of Waikato, New Zealand. Weka is a freely available Java-based software which implements several machine learning methods and algorithms that could extract rules and functions from large datasets ( Bouckaert et al., 2010) .
3.3 Train and test set Using a custom script, the dataset was split randomly into 80% train-cum-validation set and a 20% independent test set. The training set and test set were then converted to ARFF (attribute relation file format) for the supervised machine learning analysis on Weka for model generation. A tenfold cross validation was employed for training and validation set.
3.4 Removal of attributes Owing to the fact that the descriptors having uniform values throughout the dataset do not contribute towards the classification of compounds and the huge dimension of the data would otherwise lead to increased computational time and memory, the attributes having only one value (all 0’s or all 1’s) throughout were filtered using RemoveUseless module of Weka.
3.5 Predictive modelling by machine learning In our study, we used three different classifiers, namely, Naïve Bayes, Random Forest and J48. The Naïve Bayes classifier, which is based on the Bayesian theorem, assumes that each predictor is conditionally independent of the other (Friedman et al., 1997). It is one of the most effective and simplest classifier. The Random Forest (RF) algorithm, based on multiple decision trees, was developed by Breiman (2001). J48 classifier is simple C4.5 decision tree for classification. It creates a binary tree. The decision tree approach is most useful in classification problem. It builds decision trees from a set of labelled training data using the fact that each attribute of the data can be used to make a decision by splitting the data into smaller subsets (Quinlan, 1993). Cost
44
M. Kumari and S. Chandra
sensitivity was introduced by means of meta-learners. The one meta-learners employed in this study were cost sensitive classifier (CSC) for Naïve Bayes, Random Forest and J48, respectively (Domingos, 1999).
3.6 Cost sensitive classifiers An important issue to consider while using standard classifiers for model building is the imbalanced nature of the dataset, i.e., the class imbalance problem. Class imbalance problem arises when the number of inactive molecules exceeds far beyond the number of actives, the minority ratio being 1.5% in our study. Standard classifiers that use equal weighting for all the classes are incapable to handle such highly imbalanced data and tends to assume all mis-classification errors cost equally. One of the alternatives for such problem is to use CSCs in which misclassification costs are used (Elkan, 2001). In our predictive modelling we used Metacost to implement cost sensitive learning on standard classifiers to overcome the misclassification.
3.7 Confusion matrix A confusion matrix, also known as a contingency table or an error matrix (Stehman, 1997), is a specific table layout that allows visualisation of the performance of an algorithm, typically a supervised learning one. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. It makes it easy to see if the system is confusing two classes. Weka introduces cost sensitivity in the base classifiers by means of a confusion matrix, which for a binary classification scheme consists of four sections: true positives (TP) for actives correctly classified as actives; false positives (FP) for inactives incorrectly classified as actives; true negatives (TN) in which inactives correctly classified as inactives and false negatives (FN) for active compounds incorrectly classified as inactive (shown in Table 1). As False Negatives are considered more important in an experiment for compound selection, we set misclassification cost for FN to lessen their number at the cost of increasing the False Positives. However, increasing the cost for False Negatives increases both the False Positives and True Positives, therefore we set an empirical upper limit of 20% on the false positive rate (FPR). Setting of the misclassification cost is always arbitrary and no general rule exists for it. It is more or less dependent on the base classifier used. Table 1
Confusion matrix Predicted class
Actual class
Active
Inactive
Active
True positive (TP)
False negative (FN)
Inactive
False positive (FP)
True negative (TN)
3.8 Statistical measures for evaluation of predictive models Quality and performance of the predictive models were evaluated using various standard statistical measures such as sensitivity, specificity, accuracy, balanced classification rate
In silico prediction of anti-malarial hit molecules
45
(BCR) and MCC based on the result of confusion matrix TP, FP, TN and FN. True positive rate (TPR) is the ratio of predicted true actives to the actual number of actives. FPR is the proportion of actual number of inactive to the predicted false actives. Accuracy indicates the overall performance of the classifiers. Sensitivity refers to the amount of total active compound that learning method predicted active. Specificity refers to the ratio of total number of inactive that actually predicted inactive. BCR is the average sum of sensitivity and specificity that gives the balanced accuracy of the classifiers that how best they are in performance by excluding the error. MCC is the statistical fitness function used in machine learning for evaluating the model quality. It gives a correlation coefficient between the observed and predicted classification a n d its value range from 1 to +1 where +1 represent a perfect prediction (http://www.damienfrancois.be/blog/files/modelperfcheatsheet.pdf; Sokolova and Lapalme, 2009; Demsar, 2006). ROC is a 2D graphical plot for the representation and visualisation of performance by plotting a graph between TPR and FPR, where TPR lies over Y axis and FPR on X axis. It gives the performance of a binary classifier system ( Fawcett, 2006) . Area under curve (AUC) of ROC is between 0 and 1. AUC near to 1 means randomly predicted TP which is higher than randomly predicted FP.
3.9 Active compounds prediction Deployment of predictive model is a critical step in order to carry out data mining from large dataset. To predict active compounds, we deployed the best predictive Random Forest model for mining of NCI diversity datasets IV containing 1596 compounds (http://dtp.nci.nih.gov/branches/dscb/div2_explanation.html), using PDI-CE version5.1.0.0-752 (http://www.pentaho.com/product/data-integration) and Weka scoring plugin version 3.6,4X (http://wiki.pentaho.com/display/EAI/List+of+Available+Pentaho+Data +Integration+Plug-Ins).
3.10 Toxicity prediction For toxicity prediction, dataset of carcinogenicity compounds AID 1194 was obtained from PubChem database maintained by NCBI. The Carcinogenic Potency Database (CPDB) is a unique and widely used as international resource of the results of over 6500 chronic, long-term animal cancer tests on over 1500 chemical substances (http:// pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=1194#aDescription). Predictive models were built and best model was deployed for the prediction of toxicity as described above.
4
Results and discussions
4.1 Descriptor generation and model construction The dataset of active (3498) and inactive (287,235) molecules was downloaded from PubChem. A total of 179 2D molecular descriptors were generated using PowerMV for the entire set of molecules. After removing useless attributes using RemoveUseless module of Weka, the number of descriptors was reduced to 154, which accounted for an approximate 15% reduction in the number of descriptors. To begin with, the standard
46
M. Kumari and S. Chandra
classifiers were used to generate the models; however, cost sensitive classification was used in case of models having low FP rate and the cost was increased for FP up to 20%. As expected, introducing cost for each of the classifier resulted in an increase in the number of True Positives and decrease in the number of False Negatives; thereby increasing the robustness of the model. The final misclassification cost used for each classifier is presented in Table 2. The Naïve Bayes classifier required lowest misclassification cost and was quite fast in terms of computation time. Table 2
Classification results
Classifier
TPR
FPR
NB
63.25
RF
44.4
J48
49.8
ROC
Accuracy
Sensitivity
Specificity
BCR
MCC
26.9
75.1
72.98
63.25
73.10
68.17
0.089
2.1
86.1
97.30
44.4
97.94
71.17
0.306
7.4
72.1
92.07
49.8
92.58
71.19
0.17
NB: Naïve Bayes; RF: Random Forest; TPR: true positive rate; FPR: false positive rate; BCR: balanced classification rate; MCC: Matthews correlation coefficient.
4.2 Model evaluation A number of models were trained using tenfold cross validation on the training dataset using different misclassification cost settings for False Negatives until cost optimised models were achieved. The best model for each classifier NB, RF and J48 was chosen based on their performance and evaluated using different statistical measures (Table 2). All statistical results reported in Table 2 are based on independent test set and not on the training set. The overall efficiency of a classifier in generating the models was judged from the accuracy of predictive models. The accuracy was ranged from 72 to 98% (Figure 1). Sensitivity and specificity plots were used for identifying the best models for each dataset for evaluating the effectiveness of the classifier in correctly identifying positive and negative labelled instances (Figure 2). The specificity was ranged from 73 to 98% and the sensitivity ranged from 44.4 to 63.25% with RF being the most sensitive classifier for the dataset and J48 the least sensitive. Since our dataset was highly imbalanced, so we measured accuracy and MCC value for assessment of the classifiers performance. In addition to this, other performance measures, the BCR rate and ROC curve analysis, were also used to prove the robustness of the model. The balanced accuracy values turned out to be satisfactory for all the models with best for Random Forest (Table 2), being more accurate than Naive Bayes and J48. ROC curve analysis has been widely accepted as one of the most reliable approach for quick performance assessment of virtual screening approaches; therefore, it is widely used in evaluating the discriminatory power of virtual screens. All the models had significant AUC obtained from ROC plot of the three classifiers depicted in Figure 3. Hence, analysis showed that Random Forest on the whole establishes to be the best classifier followed by NB and J48 producing a significant AUC of 86.1% as compared to NB (75.1%) and J48 (72.1%).
In silico prediction of anti-malarial hit molecules
47
Figure 1
Comparison of accuracy and balanced classification rate of the models generated in the present study (see online version for colours)
Figure 2
Plot of sensitivity and specificity of models generated based on molecular descriptors (see online version for colours)
4.3 Hit compounds Predictive model based on AID 1822 predicted 59 active compounds out of 1596 compounds of NCI diversity set IV and out of 59 compounds, 18 compounds were pr ed ic ted as non-toxic compounds based on toxicity predictive model. The structure of predicted compounds is shown in Table 3.
48 Figure 3
Table 3 S. No.
M. Kumari and S. Chandra ROC plot representing significant AUC curve values for Naïve Bayes, Random Forest and J48 (see online version for colours)
Anti-malarial hit compounds (see online version for colours) NCS-ID
PubChem SID
1
2805
26664520
2
11149
–
3
11150
26664346
4
16631
26665736
5
27389
26664288
Hit compounds
In silico prediction of anti-malarial hit molecules Table 3 S. No.
49
Anti-malarial hit compounds (see online version for colours) (continued) NCS-ID
PubChem SID
6
37433
26664445
7
43013
26664432
8
50751
26665171
9
67307
26665150
10
84100
26666171
11
93945
26666811
12
102025
26664639
13
122819
26732588
Hit compounds
50
M. Kumari and S. Chandra
Table 3
Anti-malarial hit compounds (see online version for colours) (continued)
S. No.
NCS-ID
PubChem SID
14
159632
–
15
331208
26667176
16
638080
26725225
17
645987
26665153
18
650438
26664985
5
Hit compounds
Conclusion and future scope
In this study, we have used a set of machine learning algorithms to construct predictive models which can be used to screen millions of compounds for activity as anti-malarial drugs. Such models could be effectively used for data mining to discover new drug candidates. In this approach, we used Weka which is an open source tool for large-scale application of machine learning algorithms for the prediction of the anti-malarial drugs. This structure activity study is the largest comparative evaluation of machine learning techniques using malarial target, M18AAP of P. falciparum to predict anti-malarial compounds, represents an important step in identifying the best machine learning algorithms that can be used for the high-throughput screening of drug candidates. Our analysis shows that a systematically designed computational model for activity based on chemical descriptors could be potentially used for virtual screening. Comparative analysis of various classifiers revealed that Random Forest performed better than the Naïve Bayes and J48. By far the best predictions were obtained with a Random Forest that has a prediction accuracy of 97.30%. Also, Weka represents a very efficient computational environment for testing and comparing machine learning algorithms, with potential applications in drug design and discovery. Hence, the predictive models generated by machine learning algorithms drastically reducing time and cost for finding a new drug.
In silico prediction of anti-malarial hit molecules
51
Acknowledgements We would like to thank Dr. Naidu Subbarao and Dr. Andrew M Lynn, School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, 110067, India. This work was done as part of the Crowd Computing for Cheminformatics programme of Open Source Drug Discovery Initiative (www.osdd.net). Authors acknowledge Ms. Salma Jamal and Dr. Vinod Scaria for technical help and guidance in the research. This crowd-sourcing initiative was funded by CSIR, India through project grant HCP001.
References Bouckaert, R.R., Frank, E., Hall, M.A., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I.H. (2010) ‘WEKA – experiences with a java open-source project’, J. Mach. Learn. Res., Vol. 11, pp.2533–2541. Breiman, L. (2001) ‘Random forests’, Mach. Learn., Vol. 45, pp.5–32. Chen, X., Chong, C.R., Shi, L., Yoshimoto, T., Sullivan Jr., D.J. and Liu, J.O. (2006) ‘ Inhibitors of plasmodium falciparum methionine aminopeptidase 1b possess antimalarial activity’, Proc. Natl. Acad. Sci. U. S. A., Vol. 103, pp.14548–14553. Demsar, J. (2006) ‘Statistical comparisons of classifiers over multiple data sets’, J. Mach. Learn. Res., Vol. 7, pp.1–30. Domingos, P. (1999) ‘MetaCost: a general method for making classifiers cost sensitive’, The First Annual International Conference on Knowledge Discovery in Data, pp.155–164. Elkan, C. (2001) ‘The foundations of cost-sensitive learning’, Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Vol. 2, pp.973–978. Fawcett, T. (2006) ‘An introduction to ROC analysis’, Pattern Recognition Letters, Vol. 27, pp.861–874. Friedman, N., Geiger, D. and GoldSzmidt, M. (1997) ‘Bayesian network classifiers’, Mach. Learn., Vol. 29, pp.131–163. Ivanciuc, O. (2008) ‘Weka machine learning for predicting the phospholipidosis inducing potential’, Curr. Top. Med. Chem., Vol. 8, pp.1691–1709. Jamal, S. (2013) ‘OSDD consortium, Scaria V. Cheminformatic models based on machine learning for pyruvate kinase inhibitors of Leishmania Mexicana’, BMC Bioinformatics, Vol. 14, p.329. Jamal, S. and Periwal, V. (2013) ‘ OSDD Consortium, Scaria V. Predictive cheminformatics analysis of anti-malarial molecules inhibiting apicoplast formation’, BMC Bioinformatics, Vol. 14, p.55. Liu, K., Feng, J., Young, S.S. and Power, M.V. (2005) ‘A software environment for molecular viewing, descriptor generation, data analysis and hit evaluation’, J. Chem. Inf. Model, Vol. 45, pp.515–522. Nankya-Kitaka, M.F., Curley, G.P., Gavigan, C.S., Bell, A. and Dalton, J.P. (1998) ‘ Plasmodium chabaudi chabaudi and P. falciparum: inhibition of aminopeptidase and parasite growth by bestatin and nitrobestatin’, Parasitol. Res., Vol. 84, pp.552–558. Newton, C.R., Taylor, T.E. and Whitten, R.O. (1998) ‘ Pathophysiology of fatal falciparummalaria in African children’, Am. J. Trop. Med. Hyg., Vol. 58, pp.673–683. Periwal, V., Kishtapuram, S. and Scaria, V. (2012) ‘Computational models for in-vitro antitubercular activity of molecules based on high-throughput chemical biology screening datasets’, BMC Pharmacol., Vol. 12, p.1.
52
M. Kumari and S. Chandra
Periwal, V., Rajappan, J.K., Jaleel, A.U. and Scaria, V. (2011) ‘Predictive models for antitubercular molecules using machine learning on high-throughput biological screening datasets’, BMC Res. Notes, Vol. 4, pp.504. Quinlan, J.R. (1993) C4.5. PROGRAMS for Machine Learning, Morgan Kaufmann Publishers, San Francisco. Sokolova, M. and Lapalme, G. (2009) ‘ A systematic analysis of performance measures for classification tasks’, Inf. Process. Manage., Vol. 45, pp.427–437. Stack, C.M., Lowther, J., Cunningham, E., Donnelly, S., Gardiner, D.L. Trenholme, K.R., Skinner- Adams, T.S., Teuscher, F., Grembecka, J., Mucha, A., Kafarski, P., Lua, L., Bell, A. and Dalton, J.P. (2007) ‘Characterization of the Plasmodium falciparum M17 leucyl aminopeptidase. A protease involved in amino acid regulation with potential for antimalarial drug development’, J. Biol. Chem., Vol. 282, pp.2069–2080. Stehman, S.V. (1997) ‘ Selecting and interpreting measures of thematic classification accuracy’, Remote Sens. Environ., Vol. 62, pp.77–89. Sud, M. (2012) ‘MayaChemTools: An open source package for computational discovery, COMP poster #306’, 243rd ACS National Meeting & Exposition, 25–29 March, San Diego, CA. Teuscher, F., Lowther, J., Skinner-Adams, T.S., Spielmann, T., Dixon, M.W., Stack, C.M., Donnelly, S., Mucha, A., Kafarski, P., Vassiliou, S., Gardiner, D.L., Dalton, J.P. and Trenholme, KR. (2007) ‘The M18 aspartyl aminopeptidase of the human malaria parasite Plasmodium falciparum’, J. Biol.Chem., Vol. 282, pp.30817–30826. Wang, Y., Xiao, J., Suzek, T.O., Zhang, J., Wang, J. and Bryant, S.H. (2009) ‘PubChem. A public information system for analyzing bioactivities of small molecules’, Nucleic Acids Res., Vol. 37, pp.623–633. Wilk, S., Wilk, E. and Magnusson, R.P. (1998) ‘Purification, characterization, and cloning of a cytosolic aspartyl aminopeptidase’, J. Biol. Chem., Vol. 273, pp.15961–1570. World Health Organization (2013a) http://www.who.int/malaria/media/world_malaria_ report_2013/en/ World Health Organization (2013b) http://www.who.int/mediacentre/factsheets/fs094/en/
Websites http://apps.who.int/bookorders/anglais/detart1.jsp?codlan=1&codcol=15&codcch=6740 http://dtp.nci.nih.gov/branches/dscb/div2_explanation.html http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=1194#aDescription http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=1822 http://apps.who.int/bookorders/anglais/detart1.jsp?codlan=1&codcol=15&codcch=6740 http://dtp.nci.nih.gov/branches/dscb/div2_explanation.html http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=1194#aDescription http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=1822 http://wiki.pentaho.com/display/EAI/List+of+Available+Pentaho+Data+Integration+Plug-Ins http://www.damienfrancois.be/blog/files/modelperfcheatsheet.pdf 20antimalarial http://www.malariaconsortium.org/media-downloads/207/Technical%20brief:% %20drug%20resistance http://www.pentaho.com/product/data-integration http://www.who.int/malaria/publications/atoz/artemisinin_resistance_containment_2011.pdf http://www.who.int/mediacentre/factsheets/fs094/en/
In silico prediction of anti-malarial hit molecules
Abbreviations M18AAP
Aspartyl aminopeptidase
ML
Machine learning
RF
Random Forest
SVM
Support vector machines
NB
Naïve Bayes
HTS
High-throughput screen
NCBI
National Centre for Biotechnology Information
TPR
True positive rate
FPR
False positive rate
BCR
Balanced classification rate
MCC
Matthews correlation coefficient
ROC
Receiver operating characteristics
AUC
Area under curve
CSC
Cost sensitive classifier
NCI Div Set IV
NCI diversity set IV
53