DOI: 10.1002/minf.201100148

Comparing Measures of Promiscuity and Exploring Their Relationship to Toxicity Xiangyun Wang[a] and Nigel Greene*[a]

Abstract: Recent research has focused on algorithms to derive numerical measures of selectivity based on panels of in vitro pharmacology assays so that one molecule’s activity profile may be compared easily with that of another. However, the questions concerning which method or algorithm is best to use, the optimal number of assays required to give an accurate measure of selectivity and the correlation of these measures to in vivo toxicity have remained largely unexplored. In this manuscript we describe a systematic approach to compare and contrast different calculation methods for promiscuity and determine the optimal number and constitution of a panel of assays to measure the selec-

tivity/promiscuity of compounds across all targets. We then go on to examine their relationship to toxicity using a Pfizer proprietary compound set that has both selectivity profiles and exploratory toxicology study results. From this study we conclude that all five methods studied are useful in estimating compound selectivity; that a small panel of between 15 to 30 binding assays can be used as a surrogate for a broader panel enabling higher throughput with lower costs and this panel will most likely have the highest prediction power when correlating this measure to in vivo effects.

Keywords: BioPrint · Compound selectivity · Promiscuity · In vivo toxicity · Gini coefficient · Thermodynamics partition index · Rescaled geometric mean · Rescaled standard deviation · Hit rate

1 Introduction

1. Toxicity linked directly to the primary pharmacology or mechanism of action of the compound 2. Toxicity linked to a defined secondary pharmacology of the compound 3. Toxicity linked to the presence of a structural fragment (toxicophore or structural alert) 4. Toxicity linked to the overall physicochemical properties of the molecule

fore these in vivo studies are not a practical solution for conducting structure-activity studies. As a result, the pharmaceutical industry is turning to the use of in vitro screening paradigms to help select compounds prior to the more labor intensive in vivo experiments. The number of in vitro assays available for use is growing rapidly and the strategies for screening compounds are continually evolving.[2] Recent publications have implicated that the relative selectivity of a compound for its primary pharmacology target is an important factor in determining the safety profile of a new drug.[3] The use of hit rate, a simple measure of how many protein targets a compound inhibits above a set threshold, typically set at greater than 50 % inhibition at a concentration of 10 mM, compared to the total number of protein targets that the compound is screened against, is just one way to look at relative promiscuity of a compound. Other methods have also been published that take a panel of responses in in vitro pharmacology assays and use these to generate a single score of selectivity or promiscuity.[4] The Gini coefficient measures statistical dispersion and is

The use of early in vivo toxicology testing to explore both pharmacology-related and compound-based adverse effects is a commonly applied strategy in drug discovery[1] but by our estimates the cost of running a two-week rodent study is approaching $100 000 and it can take several months to prepare sufficient quantities of chemical matter, perform the study and analyze the results. There-

[a] X. Wang, N. Greene Compound Safety Prediction Group, Worldwide Medicinal Chemistry, MS 8118-B3, Pfizer Inc. 445 Eastern Point Road, Groton, CT 06340, USA *e-mail: [email protected] Supporting Information for this article is available on the WWW under http://dx.doi.org/10.1002/minf.201100148.

The pharmaceutical industry is under increasing pressure to deliver new and safer drugs to market while constraining the research and development costs in getting them there. One of the primary causes of failure in the drug development process has been the manifestation of adverse drug reactions in the clinic resulting in the termination of the compound development or withdrawal from the market place. This late stage attrition is very costly and so the early identification of the potential for causing toxicity is highly desirable. Toxicity can be considered to have four fundamental origins:

Mol. Inf. 2012, 31, 145 – 159

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

145

Full Paper

X. Wang, N. Greene

scale-free with no units. It was originally used to measure inequality of income or wealth. On the assumption of inequality indicates selectivity, Graczyk[4b] applied the Gini coefficient to estimate the selectivity of kinase inhibitors in 85 kinase assays. Another method, Thermodynamics Partition Index,[4a] can be conceptually defined as a competition assay in which all targets of the assay panel are placed within a test tube and then compete for binding to a pool of inhibitor. At thermodynamics equilibrium, the fraction of bound inhibitor to a particular target is the partition index P for that target. However, there is still uncertainty about how many and which targets should be in the optimal panel for measuring the selectivity of a compound. In this study, we have used the CEREP BioPrint database[5] of marketed drug pharmacology profiles to look at the optimal number and constitution of a panel of assays to measure the selectivity/promiscuity of compounds across all targets so that this could be used as an indicator of potential toxicity. We then go on to compare and contrast different calculation methods for promiscuity and their relationship to toxicity using a Pfizer proprietary compound set that has both selectivity profiles and exploratory toxicology study results.

2 Materials and Methods 2.1 Compound Data Sets 2.1.1 Bioprint Profile of Marketed Drugs

Assay profiles of marketed drugs were extracted from CEREP’s BioPrint platform,[6] which includes a wide range of target classes including ADME-Tox related targets; peptide and non peptide GPCRs; various enzymes, transporters, nuclear receptors and ion channels. A subset of the Bioprint dataset was selected based on both compounds and assays having relatively complete profiles of results, i.e. each compound was tested in the majority of assays and each assay had results for the majority of compounds. This subset covers 633 drugs profiled in 185 assays with an average of only 2.2 % missing data per assay at a single concentration of 10 mM. Assay results are recorded as the percentage of inhibition when compared with positive and negative controls. The data was normalized so that negative inhibition values were replaced by 0 and inhibition values above 100 were replaced by 100. The lists of marketed drugs and assays can be found in supplemental data and the assay results for the compounds are commercially available from CEREP. 2.1.2 Pfizer Internal Compounds

Pfizer proprietary compounds with data from preclinical exploratory toxicology studies (ETS) were previously collected and have been used in previously published analyses.[7] All in vivo studies were performed in rats or dogs, with corre146

www.molinf.com

sponding summary pharmacokinetic exposure data (Cmax and AUC). The details of the studies and the compounds were described elsewhere by Hughes et al.[7c] In this exercise, 188 compounds were selected because they had also been profiled through the majority of the BioPrint assays. This subset of compounds were further divided into two groups based on whether or not there were in vivo findings at or below a Cmax of 10 mM (total drug): compounds with no findings at exposures greater than or equal to 10 mM were classed as clean@10mM, 61 compounds fell into this category; and compounds with in vivo findings at exposures less than 10 mM were classed as findings@10mM, 127 compounds fell into this category. In addition, BioPrint assays with more than 67 % of the data missing were removed from consideration, which resulted in 120 assays in common between this and the marketed drug dataset described earlier.

2.2 Measures of Promiscuity 2.2.1 Reversed Hit Rate (RHR)

The hit rate of a compound is defined as the percentage of assays where the test article produces greater than or equal to 50 % inhibition in the assay. The reversed hit rate is then 1 minus the hit rate. This is done primarily so that the value and directionality is consistent with all the other methods where larger scores indicate good selectivity and smaller scores indicate greater promiscuity. 2.2.2 Modified Gini Coefficient (Gini)

The assumption of measuring compound selectivity using the Gini coefficient is that “selectivity” is analogous to “inequality”. However, in this particular exercise we are looking at off-target activity of a compound with no regard for potency at its primary pharmacology. Therefore, simply looking at the compound’s BioPrint profile would potentially lead to some erroneous assignments of promiscuity. By way of illustration, compound A with all assay inhibition values equal to 10 % and compound B with all assay inhibition values equal to 80 % will have the same Gini coefficient. This is because the Gini coefficient is scale free; it only considers the relative strength among values used in the equation. Thus, for compounds that are inactive in majority of the assays so that the inhibition values are relatively equal to each other, Gini coefficients will be small and thus indicate promiscuity. Therefore, compound X with inhibition values: 0, 0, 0, 1, 2, 2, 3, 3, 4, 4, 5, 7, 8, 9 %, and 11 % in a 15 assay panel, has a Gini coefficient of 0.47 that suggests that it is somewhat promiscuous. However, assuming that X is a potent inhibitor against its primary pharmacology, compound X is obviously very selective since it is considered to be inactive in all other assays. This observation prompted the addition of an extra assay inhibition value set at 100 % to the BioPrint profile, on the

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2012, 31, 145 – 159

Comparing Measures of Promiscuity and Exploring their Relationship to Toxicity

assumption that each compound will have a primary target it will inhibit it at 100 % at a concentration of 10 mM. Now compound X has 16 assays, the last being the “pseudo primary target” set at 100 % inhibition. The Gini coefficient for X becomes 0.73, a more appropriate value. The detailed study of the impact of this “pseudo primary target” on the Gini coefficient is discussed later. The Gini coefficient is calculated using the “Gini” function in “ineq” package (http://cran.r-project.org/web/packages/ ineq) in the R statistics software. 2.2.3 Thermodynamics Partition Index (PI)

Thermodynamics Partition Index is implemented in R as described in Cheng et al.[4a] Inhibition values are converted to IC50 values using the Hill equation.[8] We also assume each compound inhibits its primary target at 100 % at 10 mM, and this primary target is added as an extra assay for each compound. 2.2.4 Rescaled Geometric Mean (RGM)

For each compound, inhibition values greater than 70 % are given score 0.1; values below 30 % are given score 1; and those between 30 % and 70 % were given scores calculated using the formula: Score = 0.9  (70 inhibition)/40 + 0.1; missing values were assumed to be inactive and so given a score of 1. The RGM is the geometric mean of all scores for a compound across the assays.

0.9667 and 0.977, and we observed that a handful of compounds have very different selectivity scores depending on the method of calculation. Since missing data are unknown data, we believe removing missing values give a more realistic scores and that is the method we used throughout this work. We also require that each compound has at least one-third of the data to calculate valid selectivity score.

2.2.7 Predefined Assay Panels

–Top83 panel, as the name suggests, consisted of a subset of 83 assays from the broader BioPrint panel and were selected to maximize coverage yet minimize the overlap with assays run in-house at Pfizer conducted using different protocols, conditions or reagents and hence potentially confuse project teams with different results from seemingly the same endpoint –Top31 panel was made up of 31 targets that were felt to have clear links to physiological outcomes in humans –Top15 panel was selected by Pfizer Safety Pharmacology experts who voted on the 15 targets they would be most concerned about hitting from a safety pharmacology perspective. This set includes targets classes belonging to transporters (3), GPCRs (9), Ion channels (2), and enzymes (1). It should also be noted that the Top15 assay panel is a subset of the Top31 assay panel that is itself subset of the Top83.

2.2.5 Rescaled Standard Deviation (RSD)

2.2.8 Randomized Test and other R Functions Used

For each compound, the standard deviation (SD) of the inhibition values was calculated (missing values omitted), and then subtracted from the maximum standard deviation of all compounds. All SD values were divided by maximum standard deviation so that the RSD value is between 0 and 1, and larger values indicate good selectivity consistent with all other methods described here.

All randomized test and computations were done in R. “scatterplot” function in “car” package is used to draw most scatter plots in figures. Violin plots and density plots were plotted using “lattice” and “ggplot2” packages. The R script is available upon request.

3 Results 2.2.6 Dealing with Missing Data

3.1 Comparison of Data Sets Used in Analysis

For the marketed drug data set there is an average of 2.2 % missing data for each assay. Initially missing data was replaced by assay median as most compounds were inactive in each of the assays. However, in Pfizer internal ETS data set there is a greater amount of missing data with an average of 23 % missing data per assay. Therefore, replacing missing data with the assay median could have a much greater impact on the results of the analysis. We therefore investigated removing missing data before calculating selectivity score and checked the Pearson correlation between the two methods. For RGM and PI the two methods have a correlation of 1 and 0.99997, i.e. they are not affected by difference in the two calculation methods; however RSD, RHR and Gini have correlation coefficients of 0.985,

To satisfy ourselves that any conclusions drawn from one data set could be applied to the other, we checked the distributions of molecular weight, LogP and polar surface area across the two compound sets. We also clustered the compounds in each set to ensure that we were not introducing any bias to the subsequent analysis through the over-representation of one or more chemical classes. Clustering was performed using Daylight fingerprints and Ward’s clustering[9] with a Euclidean distance of 0.1 defining each cluster size. Figure 1 shows the combined data sets where black triangles denotes Pfizer ETS compounds and grey squares denotes the marketed drug data set Plot (A) shows the distribution of CLogP versus MW; Plot (B) shows the distribution of cLogP versus Polar Surface Area (PSA); Histogram

Mol. Inf. 2012, 31, 145 – 159

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

www.molinf.com

147

Full Paper

X. Wang, N. Greene

(C) shows the relative distribution of compounds across the molecular weight ranges and the height of the bar is the count of the compounds in that range; Histogram (D) shows the number of compounds in a cluster of the specified size. 3.2 Correlation of Selectivity Score Generated from Five Methods

The marketed drug dataset contains 633 compounds tested in 185 assays. Panel A of Figure 2 shows the density plot of inhibition values of all 185 assays. All assays have low inhibition percentages for the majority of compounds, and some assays have a second peak around 100 because all values greater than 100 % are converted to 100 %. The median hit rate, where a hit is considered to be greater than 50 % inhibition, is 2.4 % (mean at 5.7 %), while 88 assays have a hit rate less than 2 percent (i.e. less than 2 % of compounds are active against that target at 10mM). Panel B shows the compound-wise density plot, with a similar pattern to the assay-wise inhibition value distribution. Selectivity scores were generated using all five of the previously described methods: Reversed Hit Rate (RHR),

Gini coefficient (Gini), Partition Index (PI), Rescaled Geometric Mean (RGM), and Rescaled Standard Deviation (RSD) using all 185 assays. Figure 3 uses violin plot, which is similar to boxplot, except that it also shows the probability density of the data at different values. Panel A and B show the distribution of selectivity scores generated from the five methods. It is apparent that the selectivity scores based on all assays (A) have very different distributions among methods, as RHR, RGM, Gini show tight distributions with tails towards the promiscuous end and PI and RSD has very flat distribution. On the other hand, scores based on the Top15 assay panel (B) are more similar to each other and display a more flat distribution with peak at selective end. To see how well the selectivity scores from different methods correlates with each other, and since each method has a different distribution, Spearman correlation coefficient based on the rank of the selectivity score was used. Table 1 shows the correlation coefficients between the 5 methods using Top15 panel (in parenthesis) and Allassay panel for the marketed drugs. In general, the correlation between methods is consistent irrespective of the panel of assays used with the exception that Gini has a

Figure 1. Distribution of properties for the marketed drug data set (black) and the Pfizer ETS data set (gray).

148

www.molinf.com

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2012, 31, 145 – 159

Comparing Measures of Promiscuity and Exploring their Relationship to Toxicity

Figure 2. Density plots of inhibition value distribution by assay (A) or by compound (B) in the marketed drug dataset. (A) Each density curve represents the inhibition value across compounds for each of the 185 assays. (B) Each density curve represents the inhibition value across assays for each of the 633 compounds.

Figure 3. Violin plots of selectivity scores for the five methods based on all assays (A) or top15 assay panel (B). Each density curve contains 633 data points.

much higher correlation to other methods when using the smaller Top15 assay panel rather than all assays suggesting that adding more assays when using Gini simply increases the noise when calculating selectivity. The other methods have slightly higher correlation using all assays than using top15 assay panel. While most compounds would be ranked similarly by each method, a detailed look at the rank of compounds shows that some compounds would have radically different rankings among five methods. Figure 4 shows four such compounds with very different rankings depending on the Mol. Inf. 2012, 31, 145 – 159

method when using all assays. The plot shows the inhibition profiles for these compounds: compound (a) is inactive Table 1. Spearman (rank) correlation of compound selectivity score between methods for All-assays and Top15 assay panel (in parenthesis). Method

RGM

Gini

PI

RSD

RHR RGM Gini PI

0.99 (0.96)

0.39 (0.77) 0.37 (0.79)

0.95 (0.91) 0.97 (0.94) 0.35 (0.85)

0.98 0.99 0.39 0.98

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

www.molinf.com

(0.91) (0.95) (0.80) (0.98)

149

Full Paper

X. Wang, N. Greene

in almost all assays; (b) has a bimodal distribution with almost no value in medium active range; (c) has an even distribution in the active assay ranges; and (d) has some assays in the medium active range but no assays in the highly active range. From Table 2 it can be seen that compounds (a) and (b) clearly show the difference between Gini and the other methods as Gini will assign a small value to a compound that is inactive in all assays because the values are more equal to each other when compared to the assay values for (b). This is clearly an incorrect assignment as compound (b) would be considered more promiscuous than compound (a). Compounds (c) and (d) also show large variances among the methods, especially between RHR and PI methods. Since (c) has more assays with values > 50 %, its RHR is smaller than (d). However, it is worth noting that if we change the threshold for defining a hit to 60 % then the RHR of (c) will increase significantly. On the other hand, PI is affected mostly by the highly active assays of the compound. Since (d) is more highly active in more assays than (c), the PI value of (d) is smaller than (c). Since each method uses a different algorithm to calculate the selectivity score, it is no surprise they will “favor” certain profiles over others. To confirm the correlation we observed above was not an artifact resulting from the specific assay panel, we did a randomized study using subsets of 15, 30, 60, and 90 assays selected randomly from 185 total assays, repeating each selection 10 000 times, calculating the corresponding selectivity scores for the 633 compounds and checking the correlation of the selectivity score across the different methods. The trends observed in selectivity score distributions using all assays and top15 assay panel were not changed. RHR almost perfectly correlates with RGM, followed by PI to RSD, RHR to RSD, PI to RHR, and Gini has the lowest correlation to all other methods. 3.3 Impact of Assay Selection on Selectivity Scores

In Graczyk’s work,[4b] he conducted randomized tests to show that different subsets of assays could result in very different Gini coefficients for individual compounds, and increasing the assay number above 85 has little impact on Gini coefficient. In our work, we are more interested in reducing the number of assays needed to estimate a selectivity score in order to reduce time and cost of compound

Figure 4. Density plot of the inhibition profiles for 4 compounds that have different selectivity score rankings from the five methods when using all assays.

profiling. Thus, subsets of 15, 30, 60, 90 assays were selected randomly from 185 total assays and repeated 10 000 times, to calculate the corresponding selectivity score for each of the 633 compounds. Two statistics were calculated for each run: 1. the Pearson correlation coefficient of the selectivity score to a “baseline score” which we set to the selectivity score calculated using all 185 assays by each method respectively to see how each subset of assays represents the overall assay profile. 2. the standard deviation of the selectivity score to measure its distribution, on the assumption that a larger standard deviation would be more likely to discriminate between compounds with different selectivity profiles Panel A of Figure 5 shows that when the assay number increases, the correlation coefficients increase and the variances in correlation coefficient decrease. This is expected as with smaller panels, different selections could generate

Table 2. The corresponding rank within 633 compounds for four compounds by each of the five methods with the actual selectivity score in parenthesis. Compound

RHR

RGM

Gini

PI

RSD

(a) (b) (c) (d)

558 (1) 113 (0.881) 421 (0.989) 189 (0.946)

594 107 366 236

73 (0.661) 575 (0.773) 388 (0.736) 196 (0.704)

545 (0.691) 67 (0.064) 262 (0.33) 342 (0.423)

604 (0.852) 90 (0.264) 316 (0.684) 248 (0.614)

150

www.molinf.com

(1) (0.77) (0.974) (0.936)

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2012, 31, 145 – 159

Comparing Measures of Promiscuity and Exploring their Relationship to Toxicity

extreme combinations of most inactive or most active assay groups, resulted in large variance of selectivity score for each compound. When the number of assay increases, the inactive assays become dominant in the selection and the selectivity score of the compound from different assay combination became larger and converge with the baseline scores. The distributions clearly show that RHR and RGM are very stable across different assay panels, while Gini, PI and RSD are very sensitive to the selection of assays, especially when assay panels getting smaller. On the other hand, the distributions of standard deviation of the selectivity scores in panel B of Figure 6 show more diverse patterns. Although the variance in standard deviation always decreases with increasing assays, the trend of the distribution varies. Standard deviations of Gini and RSD become smaller with increasing number of assays. While the trends for RHR and RGM are similar to that observed for Gini and RSD, it is not as obvious as different

assay groups have a similar center. However, the standard deviation of PI gets larger with increasing number of assays with the exception of the 60 and 90 assay groups which have similar distributions. The trend observed for Gini, RSD, RHR, and RGM is consistent with the correlation distribution as with increasing number of assays used the score converges with the baseline score, thus the standard deviation actually gets smaller. The trend observed for PI is not easy to understand at first. PI can be thought as a competitive binding assay so that adding inactive targets have no impact on the PI of the primary target. For this reason, when increasing number of assays are used, it is more likely that highly competitive secondary targets will be included and lead to the decrease in the PI for the primary target, and thus causing a greater distributions in the scores with increasing numbers of assays.

Figure 5. Violin plots of the correlation coefficients to the baseline and standard deviation of selectivity scores using randomly selected assay groups of size 15, 30, 60, 90 assays for each of the 5 methods.

Mol. Inf. 2012, 31, 145 – 159

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

www.molinf.com

151

Full Paper

X. Wang, N. Greene

Figure 6. Violin plot of the classification performance of each method for the ETS data set using AUC values and the selectivity score from randomly selected assay groups of 15, 30, 60 and 90 assays. Each density plot contains 10 000 data points for an assay group.

3.4 Using Different Assay Panels to Predict ETS Outcome

From the preceding analysis, it is clear that each method for calculating the relative selectivity of a compound has different characteristics and will classify certain compounds as either selective or nonselective depending on the algorithm. This however does not tell us which algorithm, if any, help to select compounds that are deemed to be more desirable based on their in vivo safety profiles. To explore this we took the set of compounds that has been profiled in exploratory toxicology studies (ETS), described earlier, to compare the capability of each method to predict toxicity based on compound selectivity. The classification performance was then measured by AUC (area under curve)[10] derived from receiver operating characteristic curve where a random coin toss would give rise to an AUC of 0.5, while a perfect classification method has AUC of 1. Since different assay panel selections have an impact on the selectivity scores, we again did a randomized test by selecting random 15, 30, 60, and 90 assays to check how well the score from a selected assay panel could separate the two classes of ETS compounds, toxic or clean, using the AUC value for comparison. Figure 6 shows that all methods with the exception of Gini have a tighter distribution and better classification performance with increasing number of assays in the panel. Gini is different in that all distributions center at approximately the same point and certain small assay panels can actually achieve a better classification performance than any larger panel. For the other four methods, even though smaller assay panels have larger variation and in general have lower classification accuracy than larger panels, certain smaller panels can have the better 152

www.molinf.com

classification performance than larger ones for this set of compounds. All of these results indicate that a small assay panel can achieve the same or better classification performance than larger panels if selected correctly. To further explore the effects of assay panel selection on its ability to differentiate the toxic compounds from the clean ones in the ETS compound set, three manually selected sets of assays (Top15, Top31 and Top83 as described in methods) were used. A further 4 panels were also defined based on computational analysis of the inhibition profiles of the marketed drug set: –The Cluster15 panel was selected by conducting Kmeans clustering (k = 15) on the marketed drug data and selecting the assays with the highest hit rates in each cluster –The Correlation15 panel was selected from the randomization tests on 15 assays and was the 15 assay panel with the highest correlation to the baseline –The Diverse15 assay panel is a combination of the Top15 and Cluster15 assay panel. Based on hierarchical clustering as shown in Figure 7, a couple of the Top15 assays that were either in the same cluster or had a very low hit rate were replaced with assays that are either in unrepresented clusters or targeting at different gene families. As a result, the Diverse15 set includes one member of the P450 family and one kinase in addition to the molecular classes in the Top15 set –All assays panel as the name suggests consists of all assays with less than 50 % missing data

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2012, 31, 145 – 159

Comparing Measures of Promiscuity and Exploring their Relationship to Toxicity

Table 3 shows the classification performance as measured by AUC from all five methods for the seven assay panels. Several observations from this table can be made: 1. Amongst the four 15-assay panels, the Top15 panel that was selected by Pfizer scientists has the best performance. This probably indicates selection of assays based on their pharmacology does indeed generate both meaningful and discriminating panels 2. With increasing number of assays in the panel, the performance does not increasing monotonically. The Top83 assay has the best performance, followed by All-assay and Top15. The Top31 panel, which is a subset of Top83 and includes all Top15 assays, has the worst performance. This again indicates using larger panels does not necessarily have higher discriminating power, especially when the assays included are noisy or without an understanding of the pharmacology 3. Among the five methods, PI and RGM are usually slightly better than RHR and RSD, and Gini is the worst method To get a better understanding of the classification performance, we looked at the distribution of the selectivity scores generated by the five methods. Panel A and B in Figure 8 show the distribution of selectivity scores for clean (left plot) and toxic (right plot) compounds based on Top15 panel and All-assays, respectively. The horizontal line is an empirical cutoff, which allows 10 % of the clean compounds

to be classified as with findings. The positive predictive value (PPV) and negative predictive value (NPV) is calculated at this cutoff and displayed in Table 3 in parenthesis. From the plots it can be seen that in general most clean compounds have larger selectivity scores, whereas toxic compounds have selectivity score distributed across the spectrum. All methods for calculating selectivity show bimodal distribution for the toxic compounds when applied to the Top15 assay panel. The two ends of the distribution most likely represent two sub-groups of toxic compounds: the true positives that are successfully predicted, and the false negatives whose binding pharmacologies that are not covered by the assay panel. When using the All-assay panel, the bi-modal distribution is less obvious for all methods except RSD, and more resemble a tailed normal distribution, which indicates there is no longer a clear distinction between false negatives and true positives as with the Top15 panel. The PPV and NPV value at this cutoff is well correlated with the overall AUC value as panels with high AUC value will also have high PPV/NPV, except that Gini has best PPV/NPV at one of the lesser AUC value. Another method we tested is using the “3/75” rule1 to classify compounds with clogP > 3 and PSA < 75A as toxic and the others as clean. This method has AUC value of 0.659 and PPV/NPV at 0.898/0.426, far worse than all five methods used here. The distribution of selectivity scores between clean and toxic compound is important as very different distributions

Figure 7. Hierarchical clustering of assays in marketed drug data. Each label starts with assay name, followed by the hit rate, assays in Top15 assay panel were marked with “ > ”.

Mol. Inf. 2012, 31, 145 – 159

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

www.molinf.com

153

Full Paper

X. Wang, N. Greene

Figure 8. Comparison of compound selectivity score distributions between two classes of ETS compounds using Top15 assay panel (A) and All-assays (B).

Table 3. Performance of the five selectivity methods in classifying ETS compounds using seven different assay panels by AUC value. Values in parenthesis are the PPV/NPV of the cutoff we used to classify real data. Assay Group

RHR

RGM

Gini

PI

RSD

Diverse15 Correlation15 Cluster15 Top15 Top31 Top83 All-assays

0.674 (0.895/0.695) 0.692 (0.897/0.708) 0.666 (0.875/0.677) 0.716 (0.897/0.703) 0.698 (0.903/0.708) 0.744 (0.923/0.765) 0.711 (0.918/0.743)

0.684 (0.898/0.7) 0.699 (0.905/0.722) 0.673 (0.887/0.689) 0.72 (0.891/0.695) 0.708 (0.902/0.705) 0.74 (0.923/0.765) 0.717 (0.92/0.749)

0.678 (0.897/0.697) 0.679 (0.887/0.695) 0.615 (0.812/0.639) 0.711 (0.891/0.695) 0.655 (0.872/0.669) 0.68 (0.833/0.65) 0.665 (0.842/0.65)

0.69 (0.887/0.684) 0.701 (0.878/0.684) 0.681 (0.889/0.692) 0.71 (0.893/0.697) 0.688 (0.887/0.684) 0.741 (0.908/0.725) 0.734 (0.908/0.719)

0.682 (0.898/0.7) 0.707 (0.898/0.711) 0.678 (0.88/0.682) 0.707 (0.895/0.7) 0.70 (0.9/0.703) 0.743 (0.921/0.758) 0.724 (0.921/0.752)

makes it easy to separate the two group of compounds, resulting in higher AUC values, while similar or close distributions result in low AUC values. When comparing the AUC values between the Top15 panel and All-assay panel (Table 3), Gini has a lower AUC in All-assay panel that is reflected in the overlapping distributions of the clean and toxic compounds. For RHR and RGM methods, even though in the All-assay panel the selectivity scores are getting 154

www.molinf.com

larger and the distribution is much tighter, the change is proportional between clean and toxic compounds and so results in almost the same AUC for both panels. For PI and RSD, both have higher AUC vales for the All-assay panel however the reasons for this are different. For PI, the difference in AUC is because selectivity scores for the toxic compounds decrease more than for clean compounds when using the All-assay panel whereas for RSD, the difference is

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2012, 31, 145 – 159

Comparing Measures of Promiscuity and Exploring their Relationship to Toxicity

because selectivity scores for clean compounds increase more than for toxic compounds when using the All-assay panel. Although the correlation between methods is in general high, we are more concerned about the compounds that are differentially classified between methods. For our purposes, it is more important to avoid classifying ETS clean compounds as toxic i.e. reducing the number of false positives. Using the ROC curve, a cutoff value that maximizes the sum of precision and sensitivity can be found for each method, but this is not a practical value to use. In this study, we choose a selectivity score threshold for classification that only allows for a 10 % false positive rate, i.e. only 10 % of clean compounds are misclassified as toxic compounds, as the cutoff value. We then compared the classification of compounds between methods using only the Top15 assay panel. Figure 10 shows three comparisons, one between RGM and PI (A); another RGM and RHR (B); and lastly between RSD and Gini (C). The upper plot in each case shows the scatter plot of selectivity scores with the thresholds for classification for each method shown as the horizontal and vertical lines. Dots in lower left and upper right corners represent compounds classified the same by the two methods, while dots in upper left and lower right corners represent two groups of compounds classified only by one method as toxic. The bottom plots highlight inhibition profiles of the two groups of compounds (triangles vs. dots) that are classified differently. Each set of connected dots corresponds to the sorted inhibition values of one compound, and it should be noted that not all compounds have inhibition values for all 15 assay. For the majority of the ETS compounds, all 5 methods gave the same classification. However, compounds that border between being considered selective or promiscuous tend to be classified differently by each method. These compounds can be roughly divided into three groups: a. Compounds with inhibition values evenly distributed from low to high inhibition values b. Compounds with medium inhibition value in the majority of assays c. Compounds with low inhibition values in the majority of assays but have high inhibition value in a few of them The plots in Figure 9 show that type (a) compounds have lower selectivity scores using RHR and RGM than with other methods, type (b) compounds have lower selectivity scores using Gini, whereas type (c) compounds have lower selectivity scores using PI and RSD. Using the cutoffs determined from the ETS data set, we classified the marketed drug data set using Top15 panel. The percent of drugs that are classified as toxic are as follows: RHR at 32 %, RGM at 42 %, Gini at 30 %, PI at 11 %, and RSD at 39 %. These results are not entirely unexpected as many of the compounds in this drug set are designed to Mol. Inf. 2012, 31, 145 – 159

be active against some of the receptors in the assay panel used and do indeed have adverse events associated with them that can be considered quite severe in some cases. It is also worth noting that some of these compounds would probably not meet the current standards and expectations for approval by the FDA. From the results above it is clear that general promiscuity measures are only part of a potential solution for segregating compounds based on their in vivo safety profiles.

4 Discussion The main premise for this study was the question of whether a compound’s lack of selectivity across multiple receptors and gene families will lead to an increase risk of seeing toxicity. Simple logic would dictate that the more receptors a compound interacts with the more likely we are to see unwanted effects in a biological system. However, how to define what is considered to be a selective compound and the number and constitution of the receptor assays used to measure this is not a straightforward question to answer. In an attempt to address these questions, we have compared five methods for measuring a compound’s selectivity using Bioprint assays of varying numbers and selection. The 5 methods we have considered are the Gini coefficient (Gini), Partition Index (PI), Reversed Hit Rate (RHR), Rescaled Geometric Mean (RGM), and Rescaled Standard Deviation (RSD). All methods generate a score between 0 and 1 where a score of 1 means having the greatest selectivity and 0 meaning that the compound is most promiscuous. The Gini coefficient was originally used to measure income inequality and has more recently been used to measure kinase inhibitor selectivity on the assumption that inequality would equate to selectivity.[4b] In general this assumption works well for large panel of assays, but can be problematic when dealing with compounds that are inactive in the majority of the assays. The Gini coefficient does not differentiate between compounds that are highly active in all assays and those that have little or no activity in any assay. These two situations are clearly different in terms of their biological effects and so to correct for these situations, we modified the Gini coefficient algorithm by adding a “pseudo primary target assay” and set this to a value of 100 %. How well does this modified Gini coefficient deal with the issue of no activity being seen as a lack of selectivity? Assuming that a compound with a Gini coefficient less than 0.5 is considered to be promiscuous, compounds that have a hit rate of 0 % across the panel should be considered to be selective and so have Gini coefficient larger than 0.5. Using the Top15 assay panel, 145 compounds in ETS set have Gini coefficients less than 0.5 using the original method, After inspection it was found that 23 of the 145 compounds have a hit rate of 0 %. Using the modified Gini coefficient, only one compound has Gini coefficient of less

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

www.molinf.com

155

Full Paper

X. Wang, N. Greene

Figure 9. Comparisons of selectivity scores between two methods using Top15 assay panels.

than 0.5 and hit rate of 0 %, so this new Gini coefficient does indeed reduce the potential for false positives in this data set. Adding this pseudo assay clearly helps in differentiating between these two types of compounds and so for the purpose of this study, we always calculated the Gini coefficient using this modified method. The Gini coefficient also has some unique properties and is therefore the most dissimilar from the other methods for calculating selectivity scores. The correlation of scores from Gini to other methods decreased with increasing numbers of assays used in the calculation, while all other methods have better correlation as more assays are used. When classifying ETS data, Gini is the only method with the same performance with increasing number of assays used. One explanation for these observed trends is that Gini seems particularly sensitive to the inclusion of inactive assays in the panel. From this study we observed that with increasing numbers of assays, the number of inactive assays for any given compound increases and the resulting Gini coefficient becomes increasingly closer to those calculated for the rest compounds. The Rescaled Standard Deviation (RSD) method was studied because Gini coefficient is just one of method to measure statistical dispersion, standard deviation (SD) is also a measure of statistical variance or dispersion. Intuitively, pro156

www.molinf.com

miscuous compounds should have larger variations in their inhibition values across a panel of assays and selective compounds should have smaller variations. However, there is also a limitation in using SD. For example, compound X with all inhibition values at 10, and compound Y with all inhibition values at 90, will have the same original Gini coefficient of 0 indicating promiscuity, but the same SD at 0 indicating good selectivity. However, this limitation is not a problem in real data, as rarely are there compounds that are equally and strongly active against all receptors. While in the case of the Gini coefficient, it’s quite often a compound is inactive in most assays. The Thermodynamics Partition Index (PI) was recently published and used to analyze the selectivity of kinase inhibitors.[4a] It’s been shown to work with both small and large panels and can be used with both percent inhibition values and IC50 values. The drawback of using percent inhibition values is the low resolution at very low (< 5 %) and high (> 95 %) end. In using a percent inhibition values we make the assumption that each compound will hit its primary target at 100 % at the specified concentration, however if it also hits any assay at close to 100 %, the PI value will drop to around 0.5. In reality though the IC50 value for the primary target could be much smaller than that for the offtarget activity and so the impact of hitting this secondary

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2012, 31, 145 – 159

Comparing Measures of Promiscuity and Exploring their Relationship to Toxicity

target would not cause the PI value to drop so dramatically. Because of the way PI is calculated, inactive assays close to 0 have no impact on the selectivity score at all, which is different from the other four methods, especially Gini. PI is highly correlated with RSD, and both methods identify those compounds with a bi-modal type of distribution with low inhibition values in most assays and high inhibition values in just a few as being promiscuous among those compounds considered having borderline promiscuity. Reversed Hit Rate (RHR) is a modified version of the simple hit rate method to yield a score between 1 and 0 where lower scores indicate lower selectivity. This simple hit rate method uses a single threshold to calculate the hit rate as a percentage of assays with inhibition values above certain cutoff. However, one potential issue with this approach is that using a single threshold, in this case 50 %, will put two compounds with inhibition values of 48 % and 52 % into different bins. In reality though these values are within the experimental error of the assay and so could potentially represent equivalent activities. Similarly two very different values such as 1 % and 45 % would be classed as being in the same bin. In order to see the impact of this issue we also looked at a modification to this method through the use of multiple thresholds and so dividing the inhibition values into more than 2 bins. For example, we used 30 % and 70 % as cutoffs so that values above 70 % score 2, between 30 % and 70 % score 1, and below 30 % score 0 in the hope that this would yield a more accurate hit rate. This modified method does show slightly better AUC value than the RHR method (data not shown) but not significantly so. In addition, attempts to divide value into more than 3 bins did not yield better classification performances. The Rescaled Geometric Mean (RGM) method works in a similar fashion to simply calculating the geometric mean of the inhibition values but with the main difference that values below 30 % are considered inactive and values above 70 % are considered active. When calculating geometric means active compounds were given a score 0.1 since a 0 score cannot be used for geometric means. Values between 70 % and 30 % were interpolated between 0.1 and 1 and values less than 30 % were given a value of 1 before taking the geometric mean. For comparison, we also looked at using either the geometric mean or arithmetic mean of the actual inhibition values for our classification experiments but discovered that RGM performs slightly better than both of these methods (data not shown). RGM and RHR methods are highly correlated and so both methods will identify compounds that have evenly distributed inhibition values from low to high as being promiscuous among those compounds considered as having borderline promiscuity. RGM, between the two methods, has the advantage of using loosely defined cutoffs. In fact, the 30 % and 70 % thresholds can be removed from the calculation method with minimum impact to the results.

Mol. Inf. 2012, 31, 145 – 159

With this study we have attempted to answer to the following more specific questions: 1. How well does a selectivity score calculated using a small assay panel correlate with scores using larger assay panels, and do smaller panels yield similar performance when classifying compounds compared to larger panels? The answer to this question can be derived from the randomized studies using sets of 15, 30, 60, and 90 assay panels. The correlation of selectivity scores using these subsets to a baseline score calculated using all available assays shows that in general selectivity scores calculated from small assay panels can be considered to be representative of those using broader assay panels (see Figure 5). The RHR, RGM, and PI methods all give rise to a significant proportion of the 10 000 random 15-assay panels where the correlation to the baseline is greater than 0.8. Similarly, all methods when using an assay panel size of 30 or more generate tightly distributed correlation values that are in general highly correlated to the baseline score. The randomized tests using the ETS data set to look at classification performance (see Figure 6) also show that the small panels can perform as well as larger ones, and certain small panels can even have better performance than all large panels. It could be argued that those small panels might be over-fitting, so that it only works on the ETS data set and therefore not likely to work on new compounds. Though over-fitting can be tested by cross-validation, it was not performed during this study. 2. How can we select the best subset of assays for use in a small assay panel and what is the optimal size? To address this question we used 7 different panels of assays and compared their performances in classifying the ETS data set (Table 3). The first panel chosen was the 15assay panel (Correlation15) whose selectivity scores were most highly correlated to the baseline score calculated using the all assays. The second panel was selected using cluster analysis to divide the assays into clusters based on their compound inhibition profiles for the marketed drug set (see Figure 7). We then chose the assay with the highest hit rate within each of the 15 clusters to give an assay panel (Cluster15) with the best coverage of the available assays. The third panel we looked at was the 15 assays that a group of Pfizer Safety Pharmacology experts chose (Top15) based on their concern for seeing a hit against the particular receptor. However, to address concerns around the over-representation of transporters and GPCRs in the panel, we also selected another panel of 15 assays (Diverse15) where we used the Top15 assays but replaced some of the assays with those from either different clusters or different gene families. Finally, we also looked at a panel of 31 assays (Top31) where it was felt that each target had a demonstrated link to a physiological outcome; a panel of 83 assays (Top83) which consisted of all assays that were not considered duplicative of internally run experiments. From the results shown in Table 3 It can be seen that all panels produced good results when classifying the ETS

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

www.molinf.com

157

Full Paper

X. Wang, N. Greene

compounds irrespective of the method for calculating selectivity scores. However, the Top15 panel of assays was consistently albeit modestly better than the other panels of 15 assays and was often comparable to the results seen using larger panels of assays. Given the desire to keep the cost of running experiments as low as possible while not sacrificing accuracy, it would seem that the Top15 assay panel would be the optimal size and constitution when trying to identify compounds with a higher likelihood of causing in vivo effects at low systemic exposures. 3. Which method for calculating selectivity scores works best? Based on the results of this study and the data sets we have, the order of classification performance in decreasing order is: RGM, PI, RHR, RSD, Gini. However, with perhaps the exception Gini, there is no significant difference in performance of the other four methods. The main issue with Gini is that it tends to have a high false positive rate when used to classify the ETS data set and so this makes it the worst method. It is our belief that Gini should only be used if there is a desire to classify the compounds with medium activity across the entire panel as being promiscuous. Also when considering off target effects, Gini needs to be modified as we did in this study as compounds with low activity across panel will also been classed as promiscuous by the unmodified algorithm. Among the remaining four methods, it can be seen that RHR and RGM work similarly, as do PI and RSD. However, RGM and PI have slightly higher AUC values than the other two when used to classify the ETS compounds, and they are almost unaffected by different methods for handling missing data points. Modification of the RHR method to use 3 bins as mentioned above can increase the performance of the RHR method such that it is better than PI. However, this ordering is susceptible to change if another data set is used for the comparisons.

benefit of having an understanding of their association to adverse effects. Small assay panels have comparable performance in measuring compound selectivity compared to large assay panels. In fact, based on the ETS data set, we have shown that a carefully selected, pharmacologically related assay panel has better prediction power for broad toxicity than by simply using all assays together. However, it is worth noting that no matter what assay panels are used there are always a large number of false negatives i.e. toxic compounds that do not show promiscuity in the assays. This is most likely due to the fact that toxicity can result from multiple mechanisms and not always as a result of the promiscuous nature of any given compound. Combining multiple readouts from different assays will therefore help to overcome this short fall from looking at single approaches. Work is underway in our group to identify secondary assay panels that can increase sensitivity and specificity across this set of compounds for predicting in vivo toxicity. In addition, efforts are ongoing to improve upon our classification system for toxicity as the current one is categorical and only indicative of presence or absence of effects at a specified concentration. It does not however, take into account the severity of the effects seen in vivo and the development of a system that can incorporate some measure of severity may ultimately lead to a more refined application of selectivity in predicting toxicity. In conclusion, this study shows that using small panels of binding assays can be used as surrogates for wider panels when looking at selectivity and so enables higher throughput with lower costs. It is our opinion that a panel of between 15 to 30 assays will most likely have the highest prediction power when correlating this to in vivo effects. Also, the promiscuity score is just one parameter we used to access compound safety. Several in vitro assay values and some physical-chemical properties are also used in a multiplex method to generate the final safety index.

5 Conclusions All five methods are useful in estimating compound selectivity/promiscuity against a panel of assays in high throughput fashion, however, each method has a bias towards certain compound profiles so that it will generate very different selectivity scores for those compounds when compared with other methods. Our study shows that RGM and PI have the slight edge over the other methods when trying to use these scoring methods to classifying our ETS compound set. Not all assays are useful in measuring a compound’s selectivity. Assays with low hit rates across the compound set will most likely just add noise to selectivity scores and may drown out any real signal. Although running a simulation study can find the assay combination with the best classification performance, the chance of over-fitting is great. Selection of assays by knowledge and experience gives rise to a panel with good performance and provides the added 158

www.molinf.com

Acknowledgements We would like to thank David Price and Eric Gifford for their helpful suggestions and comments during the course of the analysis work and the preparation of this manuscript.

References [1] a) I. Kola, J. Landis, Nat. Rev. Drug. Discov. 2004, 3, 711 – 715; b) J. A. Kramer, J. E. Sagartz, D. L. Morris, Nat. Rev. Drug. Discov. 2007, 6, 636 – 649. [2] a) N. Greene, R. Naven, Curr. Opin. Drug. Discov. Devel. 2009, 12, 90 – 97; b) J. L. Stevens, T. K. Baker, Drug. Discov. Today 2009, 14, 162 – 167. [3] a) Y. Yang, H. Chen, I. Nilsson, S. Muresan, O. Engkvist, J. Med. Chem. 2010; b) F. Milletti, A. Vulpetti, J. Chem. Inf. Model. 2010, 50, 1418 – 1431.

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Mol. Inf. 2012, 31, 145 – 159

Comparing Measures of Promiscuity and Exploring their Relationship to Toxicity [4] a) A. C. Cheng, J. Eksterowicz, S. Geuns-Meyer, Y. Sun, J. Med. Chem. 2010, 53, 4502 – 4510; b) P. P. Graczyk, J. Med. Chem. 2007, 50, 5773 – 5779. [5] A. F. Fliri, W. T. Loging, P. F. Thadeio, R. A. Volkmann, Proc. Natl. Acad. Sci. USA 2005, 102, 261 – 266. [6] C. M. Krejsa, D. Horvath, S. L. Rogalski, J. E. Penzotti, B. Mao, F. Barbosa, J. C. Migeon, Curr. Opin. Drug. Discov. Devel. 2003, 6, 470 – 480. [7] a) M. D. Aleo, J. Aubrecht, M. Banker, J. W. Benbow, D. Nettleton, The Toxicologist CD – An official Journal of the Society of Toxicology 2009, 108; b) N. Greene, M. D. Aleo, S. Louise-May, D. A. Price, Y. Will, Bioorg. Med. Chem. Lett. 2010, 20, 5308 – 5312; c) J. D. Hughes, J. Blagg, D. A. Price, S. Bailey, G. A. Decrescenzo, R. V. Devraj, E. Ellsworth, Y. M. Fobian, M. E. Gibbs,

Mol. Inf. 2012, 31, 145 – 159

R. W. Gilles, N. Greene, E. Huang, T. Krieger-Burke, J. Loesel, T. Wager, L. Whiteley, Y. Zhang, Bioorg. Med. Chem. Lett. 2008, 18, 4872 – 4875. [8] S. Goutelle, M. Maurin, F. Rougier, X. Barbaut, L. Bourguignon, M. Ducher, P. Maire, Fundam. Clin. Pharmacol. 2008, 22, 633 – 648. [9] a) R. D. Brown, Y. C. Martin, SAR QSAR Environ. Res. 1998, 8, 23 – 39; b) D. J. Wild, C. J. Blankley, J. Chem. Inf. Comput. Sci. 2000, 40, 155 – 162. [10] J. A. Hanley, B. J. McNeil, Radiology 1983, 148, 839 – 843.

 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim

Received: November 2, 2011 Accepted: December 21, 2011 Published online: February 1, 2012

www.molinf.com

159

Comparing Measures of Promiscuity and Exploring Their Relationship to Toxicity.

Recent research has focused on algorithms to derive numerical measures of selectivity based on panels of in vitro pharmacology assays so that one mole...
2MB Sizes 1 Downloads 8 Views