J Comput Aided Mol Des (2015) 29:595–608 DOI 10.1007/s10822-015-9852-5

Comparison of bioactive chemical space networks generated using substructure- and fingerprint-based measures of molecular similarity Bijun Zhang1 • Martin Vogt1 • Gerald M. Maggiora2,3 • Ju¨rgen Bajorath1

Received: 30 March 2015 / Accepted: 3 June 2015 / Published online: 7 June 2015  Springer International Publishing Switzerland 2015

Abstract Chemical space networks (CSNs) have recently been introduced as a conceptual alternative to coordinatebased representations of chemical space. CSNs were initially designed as threshold networks using the Tanimoto coefficient as a continuous similarity measure. The analysis of CSNs generated from sets of bioactive compounds revealed that many statistical properties were strongly dependent on their edge density. While it was difficult to compare CSNs at pre-defined similarity threshold values, CSNs with constant edge density were directly comparable. In the current study, alternative CSN representations were constructed by applying the matched molecular pair (MMP) formalism as a substructure-based similarity criterion. For more than 150 compound activity classes, MMP-based CSNs (MMP-CSNs) were compared to corresponding threshold CSNs (THR-CSNs) at a constant edge density by applying different parameters from network science, measures of community structure distributions, and indicators of structure–activity relationship (SAR) information content. MMP-CSNs were found to be an attractive alternative to THR-CSNs, yielding low edge densities and well-resolved topologies. MMP-CSNs and corresponding THR-CSNs often had similar topology and

& Ju¨rgen Bajorath [email protected] 1

Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universita¨t, Dahlmannstr. 2, 53113 Bonn, Germany

2

BIO5 Institute, University of Arizona, 1657 East Helen Street, Tucson, AZ 85721, USA

3

Translational Genomics Research Institute, 445 North Fifth Street, Phoenix, AZ 85004, USA

closely corresponding community structures, although there was only limited overlap in similarity relationships. The homophily principle from network science was shown to affect MMP-CSNs and THR-CSNs in different ways, despite the presence of conserved topological features. Moreover, activity cliff distributions in alternative CSN designs markedly differed, which has important implications for SAR analysis. Keywords Chemical space networks  Biologically relevant chemical space  Tanimoto similarity  Substructure-based similarity  Matched molecular pairs  Network topology  Community structures  Statistical measures  Structural-activity relationship information

Introduction The concept of chemical space is much discussed in computational and medicinal chemistry, chemical biology, and other areas of the life sciences that utilize small molecules to probe biological activities [1]. In principle, chemical space is discrete and finite, given that there is a theoretical limit on the possible number of molecules that can be obtained. However, this theoretical number is so large (on the order of 1060 possible small organic compounds) [2] that chemical space considerations mostly focus on small, relatively confined biologically relevant regions populated with active compounds [1]. However, although chemical space-related issues are often considered in the context of compound library design, biological screening, or structure–activity relationship (SAR) analysis, the concept often remains fairly elusive, without a clear view of how chemical space should be rationalized, represented, and navigated. Clearly, there is no currently

123

596

available universal and generally accepted representation of chemical space. More formal descriptions of chemical space have primarily originated from computational chemistry and chemical informatics including, first and foremost, coordinate-based representations of chemical space [3, 4]. Such representations refer to N-dimensional reference spaces generated from numerical chemical descriptors in which compounds are represented as feature vectors and assigned descriptor coordinates [3, 4]. Since each of the chosen descriptors adds a single dimension to the multi-dimensional space, such spaces are generally of very high dimension. Hence, some form of dimensionality reduction is typically applied to remove less important descriptors. However, this can lead to a significant loss of information that can compromise the ability of the reduced-dimension feature vectors to fully represent chemical spaces. Moreover, as coordinate-based representations are essentially continuous in nature they are not fully consistent with the inherently discrete nature of chemical spaces. As an alternative, coordinate-free representations of chemical space can be created by replacing feature vectors with pairwise compound similarities [4, 5], which are treated as elements of similarity matrices. As the values of these elements depend on the specific molecular representation and similarity function used in their computation [5], the corresponding chemical spaces are also dependent on them. However, this dependence also applies to coordinate-based reference spaces where pairwise distances are calculated as an inverse measure of compound similarity—the shorter the distance between two compounds, the more similar they are thought to be [4]. Hence, the lack of invariance of chemical spaces to the molecular representation and similarity function used in the construction is a well known problem that plagues any similarity-based application. While arrays of pairwise similarity relationships yield possible reference spaces, they are purely numerical in nature and difficult to analyze. Therefore, as another, graphically accessible form of coordinate-free chemical spaces, molecular networks can be considered in which compounds are represented as nodes (vertices) and pairwise similarity relationships as edges (connecting vertices) [4]. In addition to their graphical accessibility, an advantage using networks to delineate chemical space is that the resulting representations can be characterized by statistical methods originating from interdisciplinary work in network science [6–8]. This still evolving field covers network representations from all areas including the life sciences, social sciences and communities, or the Internet [8–10]. Network science goes beyond purely mathematical analysis and attempts to elucidate characteristic properties and latent structural features of networks, regardless of their origin and regardless of whether they represent different functional or social relationships. Hence, network science aims to

123

J Comput Aided Mol Des (2015) 29:595–608

deduce characteristics and design features that apply generally to network representations in different contexts. A good example is provided by the principle of ‘‘homophily’’ [11], which states that nodes having similar attributes are more likely to be connected to each other than to nodes with other characteristics. This tendency implies that many relationships captured in networks of different origin are generally based upon evident or hidden commonalities between different objects [11]. The development of networks for chemical informatics is still at an early stage. Thus far, only a handful of studies have attempted to describe molecular relationships with the aid of network representations [12–16]. Among these has been the recent introduction of chemical space networks (CSNs) [4] as an alternative to coordinate-based representations of chemical space [16]. CSNs were initially generated as threshold networks, where edges are drawn between pairs of vertices if the similarity of the corresponding molecules, computed using molecular representations based on fingerprint descriptors [17] and the Tanimoto similarity function [5], meets or exceeds the threshold value. Threshold CSNs (THR-CSNs) were characterized using methods from network science. The homophily principle was found to play a major role in CSN topologies, consistent with the presence of numerous ‘‘community structures’’ found in different CSNs. CSNs provide a very intuitive access to biologically relevant chemical space and make it possible to graphically analyze SAR information on the basis of similarity relationships and network structures. In addition, CSNs immediately reveal local SAR regions of different character (e.g. representing continuous or discontinuous SARs) as compound clusters. Accordingly, subsets of compounds that are SAR-informative and provide starting points for further chemical exploration or optimization can be directly selected from CSNs. Furthermore, global structural relationship and SAR views captured in CSNs can be compared for different data sets. CSNs were difficult to compare at pre-defined similarity threshold values due to the presence of compound classspecific similarity value distributions [16]. However, those with constant edge density, given by the ratio of the number of edges divided by the total number of possible edges, were straightforward to compare. At high edge density levels, THR-CSN topologies were not well defined and difficult to resolve. By contrast, at low-density levels, the modularity of CSNs representing classes of bioactive compounds was generally high and specific compound community structures emerged. Not surprisingly, THRCSNs from randomly collected synthetic molecules were found to have much lower modularity [16]. Continuous similarity measures leading to threshold networks provide one possible, albeit the most straightforward,

J Comput Aided Mol Des (2015) 29:595–608

way to account for molecular similarity as a basis for the construction of CSNs. Another way is afforded by substructure-based similarity approaches [5, 18] that provide a binary rather than continuous read-out of similarity values, since in the former case two compounds are considered to be similar if they share a pre-defined structural fragment and not similar otherwise. At a first glance, this might appear to be a rather coarse way of accounting for similarity. Substructure-based similarity is, however, often more intuitive and easier to interpret in chemical terms than Tanimoto similarity [18], a characteristic that in many instances is considered to be an advantage in medicinal chemistry, although this is not necessarily the case in chemical informatics given the popularity of molecular fingerprints and Tanimoto similarity. For example, in medicinal chemistry, in the study of activity cliffs, defined as pairs or groups of structurally similar compounds having large potency differences [18], substructure-based similarity assessment has provided a substantial advance in rationalizing cliffs and associated SAR information [18–20]. Hence, in CSNs substructure-based similarity should be considered alongside continuous similarity measures, especially for charting biologically relevant chemical space where ease of interpretation is a major issue for SAR analysis. The concept of matched molecular pairs (MMPs) [21] provides an elegant way of assessing substructure-based similarity relationship. An MMP is defined as a pair of compounds that are only distinguished by a chemical change at a single site [21], i.e., a substructure exchange, which is often termed a chemical transformation [22]. Transformation size-restricted MMPs have been introduced, in which substructure exchanges are limited to small chemical modifications typically observed in analog series [19]. The calculation of such MMPs is thought to represent a meaningful substructure-based similarity criterion for CSN design. As discussed above, generation of CSNs with constant edge density [16] provides an opportunity to compare CSNs based upon different types of similarity measures. In this study, CSNs were generated for 154 compound classes with high-confidence activity data using the presence of MMP relationships as a similarity criterion. These MMPbased CSNs (MMP-CSNs) were then compared in detail to corresponding THR-CSNs with constant edge densities obtained by appropriately adjusting their similarity threshold values. The results of the analysis are reported herein.

597

ChEMBL (release 20) [23]. Several selection criteria were applied to ensure high data confidence. Only compounds with direct interactions (i.e., assay relationship type ‘‘D’’) with human targets at the highest confidence level (i.e., assay confidence score 9) were selected. Assay-independent and explicitly defined equilibrium constants (Ki values) were considered as potency annotations (approximate measurements such as ‘‘[’’, ‘‘\’’, and ‘‘*’’ were disregarded). Compounds with multiple Ki measurements for the same target were retained (and averaged) if all of the values fell within the same order of magnitude. On the basis of these criteria, 154 activity classes (for human targets) comprising a total of 76,479 compounds were obtained. Network generation For each activity class, transformation size-restricted MMPs were calculated as described [19]. In transformation size-restricted MMPs, core structures were required to have at least twice the size of exchanged substituents, the difference in size of exchanged fragments was limited to at most eight non-hydrogen atoms and the maximal size of an exchanged fragment was set to 13 non-hydrogen atoms [19]. MMP-CSNs were then systematically generated in which edges between nodes (compounds) indicated pairwise MMP-relationship. For each MMP-CSN, a corresponding THR-CSN was computed using the extended connectivity fingerprint with bond diameter 4 (ECFP4) [17], a widely used topological fingerprint capturing layered atom environments. The Tanimoto coefficient (Tc) threshold value was systematically varied to obtain a CSN having the same edge density as the MMP-CSN. All CSN representations were generated and analyzed using inhouse Java software and the Java universal network/graph framework (JUNG) [24]. The layout of CSNs was derived by applying the Fruchterman-Rheingold algorithm [25] that combines similar objects into clusters and separates clusters from each other in a force-directed manner for visualization (hence, inter-cluster distances in CSNs have no chemical relevance). This layout algorithm has previously been applied to generate network-like similarity graphs [12]. Nodes in CSNs were color-coded by potency using a continuous color spectrum from red (lowest potency in a data set) over yellow (intermediate) to green (highest potency).

Materials and methods

Network comparison

Compound activity classes

For each MMP-CSN, its edge density, defined as the number of edges observed in a CSN divided by the number of possible edges between all pairs of nodes, was determined and the Tc threshold value of the corresponding

Target-specific compound activity classes with at least 100 compounds were collected from the latest version of

123

598

THR-CSN was adjusted to yield the same edge density. The MMP-CSN and corresponding THR-CSN were compared using the layout obtained for the MMP-CSN. If required, node positions in a THR-CSN, which were computed retaining the node positions of the MMP-CSN, were individually modified to adjust edge lengths for clarity. Network properties Three parameters from network science were applied to characterize and compare MMP-CSNs and THR-CSNs at constant edge density. Mathematically, a network can be described by a tuple N = (V, E) consisting of a set V of n vertices vi, i = 1,…, n and a set E of m undirected edges eij = (vi, vj). The edge information of a network is conveniently represented by an adjacency matrix A = (aij)i,j = 1…n where aij = 1 if there is an edge between vi and vj and 0 otherwise. Degree assortativity Degree assortativity of a network is defined as the correlation coefficient between the degrees of connected nodes:   P ki kj 1  i;j  n aij  2m ki kj   r¼P ki kj k d  i ij 1  i;j  n 2m ki kj where ki is the degree of node vi and dij is the Kronecker delta function [8]. Assortativity can be rationalized as the tendency of nodes to preferentially connect to other nodes that have similar features. Assortativity is directly related to the homophily principle. In homophilic networks, assortativity is generally high [26, 27]. Degree assortativity falls into the range [-1,1]. In networks with negative assortativity values, edges tend to connect high-degree vertices to low-degree vertices, whereas in networks with positive values edges tend to connect vertices of similar degrees. A negative assortativity value is indicative of the presence of hubs (i.e., low-degree vertices grouped around central high-degree vertices), whereas a positive value indicates a more homogeneous topology and is consistent with networks forming densely connected clusters of various sizes.

J Comput Aided Mol Des (2015) 29:595–608

Ci ¼

2jEðCi Þj ki ðki  1Þ

where   EðCi Þ ¼ ejk 2 Ej1  j; k  n and eij 2 E and eik 2 E is the edge set of the subnetwork Ci of N induced by the neighbors of vi. In threshold networks, the global clustering coefficient generally increases with increasing threshold values [16]. The clustering coefficient falls into the range [0,1]. In MMP-CSNs and THR-CSNs, in which edges represent similarity relationships between compounds, the clustering coefficient quantifies the probability that two compounds similar to a third one will also be similar to each other. The coefficient indicates how strongly compounds within network clusters are connected to each other. Modularity Modularity quantifies the extent to which a classification of nodes into a number of communities (clusters) is reflected by network topology. It is given by   1 X ki kj Q¼ aij  dðvi ; vj Þ 2m 1  i;j  n 2m where the d is a classification function: d(vi, vj) = 1 if nodes vi and vj belong to the same community (cluster) and d(vi, vj) = 0 otherwise. The value of Q falls into the range [-0.5, 1). In order to detect community structures on the basis of network topology, the equation above must be optimized with respect to the classification function d, which represents a hard optimization problem [8]. However, heuristic optimization methods have been introduced including the algorithm introduced by Newman [28] that was applied in our study. Optimization algorithms will usually yield at least moderately positive modularity values by identifying an appropriate clustering. Negative values only occur when edges are more frequent between clusters than within clusters. This will, for instance, be the case for the partitions of bipartite networks. Comparison of cluster distributions

Clustering coefficient The clustering coefficient accounts for the tendency of two nodes to be connected if they share a neighbor. The global clustering coefficient of a network is given by the average of clustering coefficients Ci of all nodes [8]:

123

The distribution of clusters, or ‘‘clustering’’, generated by the Newman algorithm [28] was compared for MMP-CSNs and THR-CSNs using a normalized mutual information measure [29, 30]. A clustering is a partition of the nodes of a network into non-overlapping subsets

J Comput Aided Mol Des (2015) 29:595–608

U ¼ fU1 ; U2 ; . . .; Uk g of sizes ai ¼ jUi j; i ¼ 1; . . .; k P where ki¼1 ai ¼ n is the total number network nodes. The information content of a clustering is quantified by its Shannon entropy [29] HðUÞ ¼ 

k X ai i¼1

n

log

ai n

Given two clusterings U and V of a network separated into k and l subsets, respectively, mutual information quantifies the amount of information that is shared between the two cluster distributions. It can be evaluated on the basis of the contingency table M ¼ ðmij Þi¼1...k; j¼1...l where mij = |Ui \ Vj| denotes the number of nodes Ui and Vj have in common. The mutual information is then given by MIðU; VÞ ¼ HðUÞ þ HðVÞ  HðU; VÞ where HðU; VÞ is the Shannon entropy of the contingency table defined by HðU; VÞ ¼ 

k X l X n i¼1 j¼i

ij

n

n n

log ij :

The mutual information values fall into the range 0 to minðHðUÞ; HðVÞÞ. In order to compare the similarities of different pairs of clusterings, a normalization procedure is applied that yields scores in the range of 0–1. As suggested [30], maxðHðUÞ; HðVÞÞ was used for normalization yielding NMIðU; VÞ ¼

MIðU; VÞ : maxðHðUÞ; HðVÞÞ

Degree of edge overlap The degree of edge overlap (D) provides an alternative method for assessing the similarity of two CSNs based on the same set of compounds but different schemes for allocating edges. Formally, for two networks A and B with identical vertex sets and edge sets EA and EB, respectively, D is defined as the ratio of the edges common to both networks relative to the total number of edges D¼

jEA \ EB j jEA [ EB j

and thus 0 B D B 1. Structure–activity relationship information In addition to analyzing network properties and clusterings, the distribution of activity cliffs was calculated and

599

compared for MMP-CSNs and THR-CSNs. As stated above, pairs or groups of structurally similar or analogous compounds sharing the same activity but having large differences in potency form activity cliffs. Accordingly, given their ‘‘small structural change—large potency effect’’ characteristic [18], activity cliffs are regarded as indicators of SAR information content [18, 20], which is highly relevant for bioactive CSNs [16]. Activity cliffs were calculated for compound pairs connected by an edge. Such compound pairs were required to meet the MMPCSN or THR-CSN similarity criterion and have at least a 10-fold difference in potency. So defined activity cliffs were then partitioned into three potency difference ranges including sets having an at least 10-fold, 100-fold, or 1000-fold potency differences.

Results and discussion Study motivation and concept How molecular similarity should be assessed is a central issue in the construction of CSNs. As originally introduced, a similarity threshold value must be pre-defined for THRCSNs; hence, they are dependent on the type of molecular representation and similarity function (i.e., similarity measure) used. By contrast, the use of substructure-based similarity might simplify CSN construction and make it less subjective thus alleviating the need to establish suitable similarity thresholds. However, application of substructure-based similarity criteria inevitably leads to a binary classification (i.e., similar vs. not-similar). In such cases, the CSNs offer no opportunities for edge density adjustments and topology modification, which are principal strengths of THR-CSNs. Thus, it was an open question whether CSNs generated on the basis of substructure relationships would provide meaningful representations of biologically relevant chemical space and, if so, how CSNs based upon alternative similarity measures might compare. In order to explore alternative CSNs, transformation size-restricted MMPs were generated to establish substructure-based similarity relationships, in analogy to their use in activity cliff assessment [18, 19]. The generation of such MMPs limits substructure relationships to analogs with a common core structure [19] and represents a conservative approach to substructure-based similarity assessment. Applying this preferred substructure similarity criterion, MMP-CSNs for 154 qualifying activity classes were systematically generated and their edge densities determined. Then, for each MMP-CSN, a corresponding THR-CSN was generated by individually adjusting similarity threshold values such that MMP-CSNs and THRCSNs of each activity class had the same edge density. At

123

600

constant edge density level, corresponding CSN representations were then systematically analyzed and compared, as described in the following. Edge densities and graphical comparison of MMPCSNs and THR-CSNs Figure 1 shows exemplary MMP-CSNs for three activity classes of different size and composition and the corresponding THR-CSNs. The activity classes included inhibitors of matrix metalloproteinase 13 (Fig. 1a) as well as antagonists of orexin receptor 2 (Fig. 1b) and the bradykinin B1 receptors (Fig. 1c). The MMP-CSNs of these classes had relatively low edge density of 4.6 % (Fig. 1a), 5.2 % (1b), and 2.5 % (1c), which is desirable for CSN analysis and comparison [16]. Low edge density in MMPCSNs was a direct consequence of the conservative assessment of substructure relationships on the basis of transformation size-restricted MMPs. Figure 1d shows exemplary compound structures. Figure 2a reports the global distribution of edge densities obtained for MMP-CSNs across all activity classes and confirms that the observations made in Fig. 1 were representative. Most MMP-CSNs had edge densities below 8 % and densities above 10 % were only observed for a few activity classes of small size. Moreover, for activity classes of increasing size, edge densities generally decreased. MMPCSNs of activity classes with more than 500 compounds displayed edge densities of at most 2 %. Overall, 44 % of all MMP-CSNs had an edge density lower than 2.5 %. Figure 2b reports the global distribution of average node degrees over activity classes of increasing size that displayed no obvious trend. These findings reinforced the formation of transformation size-restricted MMPs as a criterion for substructure-based similarity relationships and were further corroborated by graphical analysis of MMP-CSNs. As shown in Fig. 1, MMP-CSNs with low edge density had generally well-resolved topologies with clear community structures in which multiple compounds forming MMPs with the same core were represented by cliques. Figure 1 also shows a side-by-side comparison of MMPCSNs and corresponding THR-CSNs adjusted to have the same edge densities. In order to enable direct graphical comparison, the Fruchterman–Reingold layout obtained for the MMP-CSNs was used to display the corresponding THR-CSNs. Accordingly, if the overall distribution of similarity relationships would significantly differ in MMPCSNs and THR-CSNs, the layout would be distorted by the presence of many ‘‘long’’ edges (indicating distinct similarity relationships in threshold MMPs), which could not be easily adjusted. However, this was generally not the case. Although there were locally confined differences in edge density between MMP-CSNs and THR-CSNs and also

123

J Comput Aided Mol Des (2015) 29:595–608

distinct similarity relationships, as illustrated in Fig. 1, community structures were overall comparable in MMPCSNs and corresponding THR-CSNs. This was the case, although the edge overlap between the two corresponding CSNs in Fig. 1 was on average only *70 %, as further discussed below. The edge overlap of the exemplary CSNs in Fig. 1 is reported in Table 1. In Fig. 1, ECFP4 Tc values are also reported for THRCSNs that matched the edge density of their MMP-CSN counterparts. These values ranged from 0.58 (Fig. 1b) and 0.61 (Fig. 1a) to 0.63 (Fig. 1c). THR-CSNs were originally explored at a pre-defined ECFP4 similarity threshold value of 0.55, which is typically indicative of close structural relationships [5]. However, the examples in Fig. 1 illustrate that Tc threshold values greater than 0.55 were often required to adjust the edge density of THR-CSNs. This means that THR-CSNs generated at a pre-defined ECFP4 Tc of 0.55 had higher edge densities than corresponding MMP-CSNs such that the threshold value needed to be further increased for density adjustment. Figure 3 shows the global distribution of Tc values for edge density-adjusted threshold networks and reveals that the observations made in Fig. 1 were of general nature. ECFP4 Tc thresholds ranged from 0.44 to 0.83. For activity classes of small size, significant variations of Tc threshold values were observed in the range of *0.5 to *0.7. For activity classes comprising more than 500 compounds, Tc threshold values fluctuated around 0.6. Hence, on the basis of these observations, one can conclude that a predefined ECFP4 Tc threshold value of 0.55 would be a reasonable choice for THR-CSNs in many cases, but that corresponding MMPCSNs often yielded even lower edge densities. Global statistical network properties Chemical space networks were also statistically analyzed using three parameters taken from network science, namely, the degree assortativity (DA), clustering coefficients (CC), and modularity (M), which were used to systematically compare MMP-CSNs and THR-CSNs. The values of these parameters are reported in Table 1 for the CSNs in Fig. 1, which correspond to the three exemplary activity classes described earlier—it should be noted that the parameter settings are independent of the graphical layout employed. For all three activity classes, MMP-CSNs and THR-CSNs had similarly high DA values, which reflected the presence of clear community structures, as well as similar M values. By contrast, CSNs of two of these classes—matrix metalloproteinase 13 inhibitors and orexin receptor 2 antagonists—displayed larger differences in CC values. Figure 4 reports the global distributions of DA, CC, and M values. As can be seen, both MMP-CSNs and THR-

J Comput Aided Mol Des (2015) 29:595–608 Fig. 1 Alternative chemical space networks. Shown is a side-by-side comparison of MMP-CSNs and THR-CSNs for three different activity classes including inhibitors/antagonists of a matrix metalloproteinase 13, b orexin receptor 2, and c bradykinin B1 receptor. In CSNs, nodes represent compounds and are colored according to potency using a continuous color spectrum from red (lowest potency in the class) over yellow (intermediate potency) to green (highest potency). Edges in MMP-CSNs indicate substructure (MMP) relationships. In THR-CSNs, edges are drawn between pairs of compounds exceeding the Tc threshold value (lower right) yielding the same edge density as the MMP-CSN (lower left). Selected corresponding clusters are encircled. CSN statistics for these activity classes are reported in Table 1. d Exemplary compounds are shown representing the cluster encircled on the lower left of the MMP-CSN and THR-CSN in (b)

601

(a) MMP-CSN

THR-CSN

Density: 0.046

ECFP4 Tc: 0.61 pKi 4.00

10.42

(b) MMP-CSN

THR-CSN

Density: 0.052

ECFP4 Tc: 0.58 pKi 5.70

10.15

(c) MMP-CSN

THR-CSN

Density: 0.025

ECFP4 Tc: 0.63 pKi 3.62

11.00

123

602

J Comput Aided Mol Des (2015) 29:595–608

(d)

9 7

2

MMP-CSN

THR-CSN

4 8

5 1 3

6

1

2 F

O

N

3

F

F

N O

4

N

O

9.6

O

O

O

7.7

7.7

9 O

N

N

N O

N

7.7

8

O

O

O

N

N

7

F

8.3

6

N

O

N O

9.1

5

F

N

O

N

O

7.5

N O

O

O N

7.5

F

N

5.9

Fig. 1 continued

CSNs of small activity classes generally covered a wide range of parameter values. By contrast, more confined value ranges were observed for larger activity classes. Figure 4a reports the global distribution of DA values, which mostly ranged from 0.6 to 0.9. For 50 % of all MMP-CSNs and 40 % of THR-CSNs, DA values above 0.7 were obtained. The presence of well-defined compound communities of different size often gave rise to high DA values. In Fig. 4b, the distribution of CC values is reported. For larger activity classes, CC values mostly ranged from 0.5 to 0.7. CC values of at least 0.7 were detected for 16 and 19 % of all MMP-CSNs and THR-CSNs, respectively. Figure 4c shows the distribution of M values, where 72 % of all MMP-CSNs and 79 % of THR-CSNs had M values of at least 0.7. For activity classes with 500 or more compounds, high M values of greater than 0.8 were consistently observed for both MMP-CSNs and THR-CSNs. The degree of modularity is strongly dependent on edge density, as shown previously [16]. Accordingly, the presence of low edge densities, as observed for MMP-CSNs, resulted in high modularity, which was reflected by the presence of separate communities with extensive intra-

123

community connections and clustering coefficients falling into a relatively narrow range. Since the global distributions of network parameters in Fig. 4 were rather similar, and indicative of well-defined CSN topologies, network properties were also compared in a pairwise manner for corresponding MMP-CSNs and THR-CSNs. Comparison of network properties Figure 5 reports the comparison of network parameters for all MMP-CSNs and corresponding THR-CSNs. A striking finding was the very low degree of correlation between DA values in Fig. 5a, with an R2 of only 0.26. By contrast, high correlation was observed by for both CC values in Fig. 5b (R2 = 0.82) and M values in Fig. 5c (R2 = 0.82). Since correlation coefficients and modularity directly account for topological features, their high correlation between MMPCSNs and THR-CSNs reflected the presence of similar network structures, despite the use of distinct similarity measures; a surprising finding. On the other hand, DA values account for the tendency of nodes with similar

J Comput Aided Mol Des (2015) 29:595–608

603

0.20 0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00

ECFP4 Tc threshold

Density

(a)

0

500

1000

1500

2000

2500

3000

0

500

1000

3500

1500

2000

2500

3000

3500

Size of activity class

Size of activity class

(b)

Fig. 3 Similarity threshold values. The distribution of ECFP4 Tc threshold values for all THR-CSNs with edge density matching the corresponding MMP-CSNs is reported. Values are plotted against the size of activity classes

30

Average node degree

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

25 20

which is also reflected in Fig. 5a. These effects were accompanied by low correlation. It follows that the homophily principle more strongly affected MMP-CSNs than corresponding THR-CSNs.

15 10 5 0

0

500

1000

1500

2000

2500

3000

Degree of edge overlap and mutual information analysis

3500

Size of activity class Fig. 2 Distribution of edge densities and average node degrees. a Edge densities of 152 MMP-CSNs (except two outliers) are plotted against the size of the corresponding activity classes (containing from 100 to 2907 compounds). In (b), the corresponding distribution of average node degrees is shown

degree to be connected to each other, in accord with the homophily principle (vide supra). In MMP-CSNs and THR-CSNs, the node degree distribution was relatively narrow, due to the presence of comparable densely connected communities, rendering DAs sensitive to differences between the degrees of connected nodes. However, as shown in Fig. 4a, for larger activity classes, DA values were mostly higher for MMP-CSNs than THR-CSNs,

Table 1 Exemplary activity classes and network characteristics

Target name

The findings revealed by comparison of network properties were further investigated by a detailed analysis of the degree of edge overlap between MMP-CSNs and corresponding THR-CSNs and mutual information (MI) analysis of network cluster distributions. Figure 6a reports the edge overlap between corresponding pairs of CSNs, which accounts for the conservation of similarity relationships. For small activity classes, edge overlap between corresponding CSNs varied significantly. For classes with more than 500 compounds, edge overlap fell mostly into the range of 0.4–0.6. Overall, 71 % of MMP-CSNs and THRCSNs shared at least half of their edges and 34 % had at least 60 % overlap. Thus, edge overlap was limited overall indicating that pairwise similarity relationships often differed.

#Cpds

MI

Edge overlap

MMP-CSN

THR-CSN

DA

CC

M

DA

CC

M 0.79

Matrix metalloproteinase 13

119

0.86

0.72

0.85

0.49

0.79

0.82

0.59

Orexin receptor 2

168

0.93

0.73

0.92

0.98

0.71

0.92

0.69

0.78

Bradykinin B1 receptor

482

0.89

0.66

0.77

0.76

0.87

0.77

0.74

0.89

For three activity classes, the CSNs of which are depicted in Fig. 1, the number of bioactive compounds and global statistical network properties are reported MI (normalized) mutual information, DA degree assortativity, CC clustering coefficient, M modularity

123

604

MMP-CSN

(a)

Threshold CSN

1.0 0.8 0.6 0.4 0.2

500

1000

1500

2000

2500

3000

0.5 0.4 0.3

0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

DA of THR-CSN

1.0

(b) 1.0

0.8

0.9

0.4 0.2

0

500

1000

1500

2000

2500

3000

3500

0.7 0.6 0.5 0.4 0.3 0.2

Size of activity class

0.1 0.0

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

CC of THR-CSN

0.8

(c) 1.0

0.6 0.4 0.2 0.0 0

500

1000

1500

2000

2500

3000

3500

Size of activity class Fig. 4 Global statistical network properties. Values of a degree assortativity, b clustering coefficients, and c modularity are reported for all MMP-CSNs (blue) and THR-CSNs (red) relative to the size of activity classes

Modularity of MMP-CSN

Modularity

= 0.82

0.8

0.6

CC of MMP-CSN

Cluster coefficient

0.6

0.1

3500

Size of activity class

(c)

0.7

0.2 0

0.0

= 0.26

0.8

0.0

(b)

1.0 0.9

DA of MMP-CSN

Degree assortativity

(a)

J Comput Aided Mol Des (2015) 29:595–608

0.9

= 0.82

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Modularity of THR-CSN Normalized MI values for pairs of exemplary MMPCSNs and THR-CSNs depicted in Fig. 1 are given in Table 1 and Fig. 6b reports the global distribution of MI values. The values were calculated as a measure of the correspondence of clusterings in MMP-CSNs and THRCSNs. In this case, a different picture emerged. Regardless of the size of activity classes, MI values were generally high. For *81 % of all corresponding pairs of

123

Fig. 5 Comparison of network properties. Scatter plots, regression lines, and coefficients of determination (R2) are reported for the comparison of a degree assortativity, b clustering coefficient, and c modularity of corresponding MMP-CSNs and THR-CSNs

CSNs, MI values of greater 0.8 were obtained, reflecting the presence of very similar cluster distributions, despite the presence of limited edge overlap. In other words,

J Comput Aided Mol Des (2015) 29:595–608

605

(a) 1.0

(a) Ratio of cliffs (Δ pKi>=1.0)

Overlap of edges

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0

500

1000

1500

2000

2500

3000

1.0 0.8 0.6 0.4 0.2 0.0

3500

Size of activity class

Ratio of cliffs (Δ pKi>=2.0)

Mutual information

(b)

0

500

1000

1500

2000

2500

3000

1500

2000

2500

3000

3500

3000

3500

3000

3500

0.8 0.6 0.4 0.2 0.0

3500

0

500

1000

1500

2000

2500

Size of activity class

100

= 0.75

80 70

(c) Ratio of cliffs (Δ pKi>=3.0)

Fig. 6 Edge overlap and mutual information analysis. a Reports the fraction of overlapping edges and b the normalized mutual information of clusterings in corresponding MMP-CSNs and THR-CSNs. Values are plotted against the size of activity classes

% connected nodes in MMP-CSN

1000

1.0

Size of activity class

90

500

Size of activity class

(b) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

0

1.0 0.8 0.6 0.4 0.2 0.0 0

60

500

1000

1500

2000

2500

Size of activity class

50

Fig. 8 Comparison of activity cliff distributions. The fraction of activity cliffs in MMP-CSNs that are conserved in THR-CSNs is reported for sets of cliffs with at least (a) 10-fold, (b) 100-fold, and (c) 1000-fold difference in potency between cliff-forming compounds

40 30 20 10 0 0

10

20

30

40

50

60

70

80

90 100

Connected nodes versus singletons

% connected nodes in THR-CSN Fig. 7 Connected nodes. The proportion of connected nodes in corresponding MMP-CSNs and THR-CSNs is compared for all activity classes

although pairwise similarity relationships in MMP-CSNs and related THR-CSNs frequently differed, their topologies and community structures were comparable.

In MMP-CSNs, the percentage of connected nodes, i.e., compounds with structural relationships ranged from 38.7 to 99.1 % across all data sets. In THR-CSNs, 40.5-99.6 % of the nodes were connected to others. Figure 7 shows that in most CSNs, the proportion of connected nodes was large (or very large) and thus the number of singletons (i.e. nodes/compounds with no structural relationships) was small. Figure 7 also shows that there was a notable

123

606

J Comput Aided Mol Des (2015) 29:595–608

#Compounds

248

MMP-CSN only

#AC in MMP-CSN only

139

THR-CSN only

#AC in THR-CSN only

104

Both CSNs

#AC in both CSNs

153

3.62

10.64

Fig. 9 Activity cliff network. Shown is a network of 248 bradykinin B1 receptor antagonists that form activity cliffs with at least 100-fold potency difference in the MMP-CSN and/or THR-CSN. Blue, red, and gray edges indicate activity cliffs found only in the MMP-CSN, only in the THR-CSN, and in both CSNs, respectively. MMP cliffs and activity cliffs were determined on the basis of transformation size-restricted MMPs and Tanimoto similarity, respectively. Nodes are colored according to potency using the continuous potency spectrum (with pKi values) shown on the lower right

correlation between the proportion of connected nodes in corresponding MMP-CSNs and THR-CSNs. We note that lead optimization data sets from medicinal chemistry typically contain many well-defined structural relationships, due to the presence of analog series, and their CSNs thus contain a very large proportion of connected nodes and only a small number of singletons. Comparison of activity cliff distributions The presence of large proportions of varying similarity relationships in MMP-CSNs and THR-CSNs suggested a comparison of their SAR information content. For CSNs charting biological relevant chemical space, SAR analysis is a premier application. Accordingly, in light of the detected relationship variations discussed above, it was highly relevant to compare SAR contained in alternative CSN representations. As an indicator of SAR information, activity cliff

123

distributions were compared, which represent a primary focal point of SAR analysis [18]. MMP cliffs [19] spanning at least 10-, 100-, and 1000-fold differences in potency were detected in all, 151, and 122 activity classes, respectively. Thus, activity cliff distributions at varying potency difference levels could be compared in corresponding CSNs essentially across all activity classes. Figure 8 reports the ratio of conserved activity cliffs in different categories across all qualifying activity classes. Consistent with the observations that similarity relationships frequently differed in MMP-CSNs and THR-CSNs, despite similar topologies and community structures, activity cliff conservation was generally limited. With increasing potency differences between cliff-forming compounds, variations in cliff overlap notably increased. In many instances, there was less than 50 % overlap between activity cliffs with an at least 100- or 1000-fold difference in potency in MMP-CSNs and corresponding THR-CSNs, thus indicating the presence of substantial differences in SAR information. Figure 9 depicts an activity cliff network for an exemplary activity class of bradykinin B1 receptor antagonists that directly compares the formation of activity cliffs with at least 100-fold difference in potency in the MMP-CSN and corresponding THRCSN and clearly illustrates a substantial difference in the distribution of activity cliffs and their limited conservation. Thus, interpretation of activity cliff patterns in the MMPCSN and THR-CSN would lead to different results, a direct consequence of applying alternative similarity measures.

Concluding remarks Second generation CSNs based upon a substructure-based similarity criterion were compared to their corresponding THR-CSNs. Networks were generated for a total of 154 activity classes of different size and composition for which high-confidence activity data were available. These activity classes covered a wide range of human targets. MMPCSNs, which are structurally more intuitive than thresholdbased CSNs, were found to represent an attractive alternative to the latter, yielding low edge densities and wellresolved topologies with defined community structures. The large-scale nature of our study made it possible to carry out systematic comparisons of network properties. In addition, the overlap in similarity relationships established using alternative similarity measures—substructure-based versus Tanimoto similarity—was determined and the structure of clusterings in MMP-CSNs and corresponding THR-CSNs was compared via mutual information analysis. These comparisons revealed that MMP-CSNs and THRCSNs had surprisingly similar topologies and community structures, although there was limited overlap in similarity

J Comput Aided Mol Des (2015) 29:595–608

relationships. Only degree assortativity emerged as a major distinguishing global network property, with essentially no correlation between MMP-CSNs and corresponding THRCSNs. These observations pointed at a substantially different influence of the homophily principle on MMP-CSNs versus THR-CSNs, despite the presence of overall comparable topological features. Accordingly, such differences were manifested in local CSN environments, especially separate compound communities. Although many corresponding community structures were identified in MMPCSNs and THR-CSNs, similarity relationships frequently differed within communities, which also led to significant differences in the distribution of activity cliffs in MMPCSNs versus THR-CSNs. Activity cliffs at varying potency difference levels were assessed as an indicator of SAR information content of alternative CSN representations. Limited conservation of activity cliffs correlated with differences in interpretable SAR information. Therefore, to study local SAR environments and focus on SAR-informative compound subsets and candidates for further chemical exploration, the use of MMP-CSNs and THRCSNs is complementary. However, given that MMP-CSNs and THR-CSNs at constant edge density had comparable topological features and community structures, they provided similar global views of bioactive chemical space and could thus be equally employed for global data set analysis. By contrast, for SAR analysis, major differences resided within community structures where varying similarity relationships encoded SARs in different ways. In conclusion, if CSNs are specifically generated for local SAR analysis, both MMP-CSNs and THR-CSNs should be considered. To select compound subsets and guide analog design, MMP-CSNs might best be used, given the ease of interpretation of MMP-based substructure relationships. However, for more global views of biologically relevant chemical space or comparison of compound classes with different activities, MMP-CSNs or THR-CSNs might be employed. Acknowledgments The authors thank Ye Hu for help with data set collection and Dilyana Dimova for MMP routines. BZ is supported by the China Scholarship Council.

References 1. Dobson CM (2004) Chemical space and biology. Nature 432:824–828 2. Bohacek RS, McMartin C, Guida WC (1996) The art and practice of structure-based drug design: a molecular modelling perspective. Med Res Rev 16:3–50 3. Pearlman R, Smith K (2002) Novel software tools for chemical diversity. 3D QSAR in drug design: three-dimensional. Quant Struct Act Relat 2:339–353

607 4. Maggiora GM, Bajorath J (2014) Chemical space networks—a powerful new paradigm for the description of chemical space. J Comput Aided Mol Des 28:795–802 5. Maggiora GM, Vogt M, Stumpfe D, Bajorath J (2014) Molecular similarity in medicinal chemistry. J Med Chem 57:3186–3204 6. Watts D, Strogatz S (1998) Collective dynamics of ‘small-world’ networks. Nature 393:440–442 7. Baraba´si A, Albert R (1999) Emergence of scaling in random networks. Science 286:509–512 8. Newman M (2010) Networks—an introduction. Oxford University Press Inc., New York 9. Newman M (2003) The structure and function of complex networks. SIAM Rev 45:167–256 10. Albert R, Baraba´si A (2002) Statistical mechanics of complex networks. Rev Mod Phys 74:47–97 11. McPherson M, Smith-Lovin L, Cook J (2001) Birds of a feather: homophily in social networks. Annu Rev Sociol 27:415–444 12. Wawer M, Peltason L, Weskamp N, Teckentrup A, Bajorath J (2008) Structure-activity relationship anatomy by network-like similarity graphs and local structure-activity relationship indices. J Med Chem 51:6075–6084 13. Tanaka N, Ohno K, Niimi T, Moritomo A, Mori K, Orita M (2009) Small-world phenomena in chemical library networks: application to fragment-based drug discovery. J Chem Inf Model 49:2677–2686 14. Krein MP, Sukumar N (2011) Exploration of the topology of chemical spaces with network measures. J Phys Chem A 115:12905–12918 15. Fourches D, Tropsha A (2013) Using graph indices for the analysis and comparison of chemical data sets. Mol Inf 32:827–842 16. Zwierzyna M, Vogt M, Maggiora GM, Bajorath J (2015) Design and characterization of chemical space networks for different compound data sets. J Comput Aided Mol Des 29:113–125 17. Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742–754 18. Stumpfe D, Hu Y, Dimova D, Bajorath J (2014) Recent progress in understanding activity cliffs and their utility in medicinal chemistry. J Med Chem 57:18–28 19. Hu X, Hu Y, Vogt M, Stumpfe D, Bajorath J (2012) MMP-cliffs: systematic identification of activity cliffs on the basis of matched molecular pairs. J Chem Inf Model 52:1138–1145 20. Stumpfe D, Bajorath J (2012) Frequency of occurrence and potency range distribution of activity cliffs in bioactive compounds. J Chem Inf Model 52:2348–2353 21. Kenny PW, Sadowski J (2005) Structure modification in chemical databases. In: Oprea TI (ed) Chemoinformatics in drug discovery. Wiley-VCH, Weinheim, pp 271–285 22. Hussain J, Rea C (2010) Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets. J Chem Inf Model 50:339–348 23. Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(Database issue):D1100–D1107 24. Java Universal Network/Graph Framework. http://jung.source forge.net. Accessed 12 Oct 2014 25. Fruchterman TMJ, Reingold EM (1991) Graph drawing by forcedirected placement. Softw Pract Exp 21:1129–1164 26. Newman M, Park J (2003) Why social networks are different from other types of networks. Phys Rev E 68:036122 27. Foster D, Foster J, Grassberger P, Paczuski M (2011) Clustering drives assortativity and community structure in ensembles of networks. Phys Rev E 84:066117 28. Newman M (2004) Fast algorithm for detecting community structure in networks. Phys Rev E 69:066133

123

608 29. Maggiora GM, Shanmugasundaram V (2005) An informationtheoretic characterization of partitioned property spaces. J Math Chem 38:1–20

123

J Comput Aided Mol Des (2015) 29:595–608 30. Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854

Comparison of bioactive chemical space networks generated using substructure- and fingerprint-based measures of molecular similarity.

Chemical space networks (CSNs) have recently been introduced as a conceptual alternative to coordinate-based representations of chemical space. CSNs w...
2MB Sizes 0 Downloads 8 Views