S0933-3657(14)00005-0 http://dx.doi.org/doi:10.1016/j.artmed.2014.01.003 ARTMED 1329

To appear in:

ARTMED

Received date: Revised date: Accepted date:

8-2-2013 15-1-2014 28-1-2014

Please cite this article as: Xulei Yang, Aize Cao, Qing Song, Gerald Schaefer, Yi Su, Vicinal support vector classifier using supervised kernel-based clustering, Artiﬁcial Intelligence In Medicine (2014), http://dx.doi.org/10.1016/j.artmed.2014.01.003 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Vicinal support vector classifier using supervised kernel-based clustering Xulei Yanga,c,∗, Aize Caob , Qing Songc , Gerald Schaeferd , Yi Sua a Department

us

cr

ip t

of Computing Science, Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR), 1 Fusionopolis Way, Singapore 138632 b The Institute for Medicine and Public Health, Vanderbilt University, 2525 West End Ave., Suite 600, Nashville, TN 37203-1738, USA c School of Electrical and Electronics Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798 d Department of Computer Science, Loughborough University, Leicestershire, LE11 3TU, UK

Abstract

Ac ce p

te

d

M

an

Objective: Support vector machines (SVMs) have drawn considerable attention due to their high generalisation ability and superior classification performance compared to other pattern recognition algorithms. However, the assumption that the learning data is identically generated from unknown probability distributions may limit the application of SVMs for real problems. In this paper, we propose a vicinal support vector classifier (VSVC) which is shown to be able to effectively handle practical applications where the learning data may originate from different probability distributions. Methods: The proposed VSVC method utilises a set of new vicinal kernel functions which are constructed based on supervised clustering in the kernel-induced feature space. Our proposed approach comprises two steps. In the clustering step, a supervised kernel-based deterministic annealing (SKDA) clustering algorithm is employed to partition the training data into different soft vicinal areas of the feature space in order to construct the vicinal kernel functions. In the training step, the SVM technique is used to minimise the vicinal risk function under the constraints of the vicinal areas defined in the SKDA clustering step. Results: Experimental results on both artificial and real medical datasets show our proposed VSVC achieves better classification accuracy and lower computational time compared to a standard SVM. For an artificial dataset constructed from non-separated data, the classification accuracy of VSVC is between 95.5% and 96.25% (using different cluster numbers) which compares favourably to the 94.5% achieved by SVM. The VSVC training time is between 8.75s and 17.83s (for 2 to 8 clusters), considerable less than the 65.0s required by SVM. On a real mammography dataset, the best classification accuracy of VSVC is 85.7% and thus clearly outperforms a standard SVM which obtains an accuracy of only 82.1%. A similar performance improvement is confirmed on two further real datasets, a breast cancer dataset (74.01% vs. 72.52%) and a heart dataset (84.77% vs. 83.81%), coupled with a reduction in terms of learning time (32.07s vs. 92.08s and 25.00s vs. 53.31s respectively). Furthermore, the VSVC results in the number of support vectors being equal to the specified cluster number, and hence in a much sparser solution compared to a standard SVM. 1

Page 1 of 18

Conclusion: Incorporating a supervised clustering algorithm into the SVM technique leads to a sparse but effective solution, while making the proposed VSVC adaptive to different probability distributions of the training data.

ip t

Keywords: support vector machines, kernel-based data clustering, supervised deterministic annealing, mammographic mass classification, biomedical data classification

cr

1. Introduction

Ac ce p

te

d

M

an

us

Support vector machines (SVMs), first introduced for pattern classification and regression problems by Vapnik and colleagues [1, 2], can be seen as a new training technique for traditional polynomial, radial basis function (RBF) or multi-layer perceptron classifiers by defining relevant kernel functions [3]. SVMs have drawn considerable attention due to their high generalisation ability for a wide range of applications and typically better performance compared to other traditional leaning machines [4–6]. Rooted in statistical learning theory [7, 8], SVMs are based on a sound theoretical justification in terms of generalisation, convergence, approximation etc., yet require the assumption that all data points in the training set are independent and identically distributed (i.i.d) according to some probability distribution(s). However, in many practical applications, the obtained training data is subject to different probability distributions with respect to different vicinities/clusters. This limits the application of the standard SVM approach for real-world problems. The assumption that the training data are i.i.d can be relaxed, and some research explores more general conditions under which successful learning can take place [9]. In particular, relaxations of the independence assumption have been considered in both the machine learning and the statistical learning literature, e.g. in [10–12] where a weaker notion of mixing to replace the notion of independence is used and it is shown that most of the main results of statistical learning theory continue to hold under this weaker hypothesis. In contrast, relaxations of the identically distributed assumption are less common, and few attempts have been proposed to address this limitation with the exception of the so-called vicinal SVM [8, 13]. The vicinal SVM was originally proposed according to the vicinal risk minimisation (VRM) principle, which introduces a new class of support vector machines by defining appropriate vicinal kernel functions. Its main motivation is to address different probability density functions with respect to different vicinities of the training dataset. A hyperplane is derived by maximising the margin between two classes under the vicinal risk function. The key issue in implementing the vicinal SVM is the construction of vicinal kernel functions for the given training data. Vapnik suggested an input space partitioning scheme [8] in order to define the vicinity for each training point by using Laplacian-type and Gaussian-type kernels. However, for ∗ Corresponding

author. Tel.:+65 64191498. Email:[email protected]

Preprint submitted to Artificial Intelligence in Medicine

January 15, 2014

Page 2 of 18

M

an

us

cr

ip t

the general non-linear case like kernel-based SVMs, applying the VRM principle with input space partitioning is not straightforward due to the non-linear mapping between input data and feature space. In this paper, we extend the work of [14] and present an effective method to construct new vicinal kernel functions for SVM learning. These are derived based on supervised clustering in the kernel-induced feature space. Our proposed vicinal support vector classifier (VSVC) is suitable for practical applications where the learning data may come from different probability distributions. VSVC proceeds in two phases. In the clustering phase, a supervised kernel-based deterministic annealing (SKDA) clustering algorithm is used to partition the training data into different soft vicinal areas of the feature space in order to construct the vicinal kernel functions. In the training phase, an SVM is constructed so as to minimise the vicinal risk function under the constraints of the respective vicinal areas defined in the clustering phase. Incorporating the supervised clustering technique into SVM learning leads to a sparse solution, while making the proposed VSVC adaptive to different probability distributions of the training data. Experimental results on both artificial and real datasets confirm that our proposed method yields higher classification accuracy and faster training compared to the standard SVM approach. The remainder of the paper is organised as follows. A brief review of the VRM principle and the vicinal SVM with input space partition is given in Section 2. Section 3 then presents our proposed VSVC based on supervised feature space partitioning. Experimental results are reported in Section 4, while Section 5 concludes the paper. 2. Vicinal SVM with input space partitioning

Let us consider the input-output training data pairs

d

{(xi , yi )}li=1 , xi ∈ Rn , yi ∈ {−1, 1},

(1)

Ac ce p

te

where l is the number of input data points, and n is the dimension of the input space. In statistical learning theory [8], these training data pairs are normally assumed to be i.i.d (independent and identically distributed) according to an unknown probability distribution p(x, y). However, in practical applications it is possible that the probability distribution is multi-modal with different vicinities of the training data pairs. To relax the identically distributed assumption, the vicinal support vector machine was proposed based on input space partitioning according to the vicinal risk minimisation (VRM) principle [8, 13], which aims to obtain solutions in the form of so-called vicinal kernels corresponding to different vicinities of the training data. The vicinity functions v(xi ) of data points xi are constructed if the training data points satisfy two assumptions: 1. The unknown density function is smooth in the vicinity of any point x i . 2. The function minimising the risk functional is also smooth and symmetric in the vicinity of any point xi .

3

Page 3 of 18

An optimisation problem based on the VRM principle, named linear vicinal SVM [8, 13], can then be formulated as minimise Φ(w) = Z

([hx, wi + b)]p(x|v(xi )))dx

+ C

l X

ξi

i=1

≥ 1 − ξi , i = 1, . . . , l

v(xi )

ξi

ip t

subject to yi

1 T w w 2

≥ 0, i = 1, . . . , l,

(2)

us

cr

where w is a weight, C is a penalty constant for the slack variable ξi , b is the offset, v(xi ) is the vicinity associated with training point xi , and p(x|v(xi )) is the conditional probability of the respective vicinity in the input space. It can be shown that this optimisation problem is equivalent to the standard SVM solution if the training vector xi is the centroid of the respective vicinity. That is, if Z xi = xp(x|v(xi ))dx, (3) v(xi )

an

then the vicinal SVM solution coincides with the standard SVM solution Z yi ([hx, wi + b)]p(x|v(xi )))dx = yi [hxi , wi + b]. v(xi )

(4)

The following theorem for vicinal SVM solution holds (see [8] for a proof):

f (x) =

M

Theorem 1. The vicinal SVM solution is formulated by l X

yi βi L(x, xi ) + b,

(5)

d

i=1

where to define the coefficients βi one has to l X

te

maximise W (β) =

βi

−

Ac ce p

i=1

subject to

l X

l 1 X βi βj yi yj M (xi , xj ) 2 i,j=1

βi y i

= 0

βi

≥ 0,

i=1

(6)

where L(x, xi ) is called the one-vicinal kernel and M (xi , xj ) the two-vicinal kernel of the vicinal SVM [8].

In [8], Vapnik suggested an input space partitioning scheme to generate the vicinity for each training point using Laplacian-type and Gaussian-type kernels. However, this approach may not work for the general non-linear case such as kernel-based SVMs, where the mapping between input training patterns and feature space is non-linear. In the next section, we will present a feature space partitioning scheme based on supervised clustering in the kernel-induced feature space to effectively construct vicinal kernel functions. 4

Page 4 of 18

3. The proposed approach The basic idea of our proposed VSVC is to construct new vicinal kernel functions, derived through supervised clustering in the feature space. These vicinal kernel functions are then used for SVM learning.

Rn xi

φ:

→ F → φ(xi )

us

cr

ip t

3.1. Supervised clustering in feature space Clustering of training data in the feature space is a well researched area [15–18]. By non-linearly mapping the observed data from a low-dimensional input space to a highdimensional feature space through a kernel function, which makes linear separation of the data easier, kernel-based clustering algorithms can reveal some structure in the data that may not be apparent by traditional clustering algorithms in the input space. Let us consider the mapping of the input space X ∈ R n to a potentially much higherdimensional feature space F through a non-linear mapping function

j = 1, 2, ..., l,

(7)

φk =

l X i=1

M

an

where F could have an arbitrarily large, possibly infinite, dimensionality, and φ(x i ) is the transformed point of xi . All l training data points are partitioned into c vicinities/clusters in the feature space, where the labelled mass center φk (z) of the k-th vicinity resides in F . This has a representation similar to k-means clustering in feature space as in [19] k = 1, 2, . . . c,

αki zi ,

(8)

Ac ce p

te

d

where c denotes the number of clusters, αki are the parameters to be decided by the clustering technique, and zi = yi φ(xi ) denotes the labelled data points in the feature space. Note that unlike k-means clustering in feature space, the proposed clustering technique is a soft clustering approach. That is, training data can be related to all c soft clusters/vinicities. Furthermore, the label information yi of data point xi is used in the clustering procedure, which can prevent training data points with different labels from being clustered into the same cluster/vicinity1. Following kernel-based algorithms, which employ the Mercer kernel representation of the dot-product in the reproducing kernel Hilbert space, the square distance between the labelled center φk and the labelled training vector zi in feature space can be defined 1 The incorporation of label information will shrink the distances between data points with identical labels, while enlarging the distances between data points with different labels, so that data points with different labels will not be clustered together.

5

Page 5 of 18

in the implicit form of kernel functions as Dk (zi ) = k zi − φk k2 = k yi φ(xi ) −

l X

αkm ym φ(xm ) k2

m=1

αkm ym hφ(xi ), φ(xm )i

ip t

l X

m=1

+

l X

αkm αkn ym yn hφ(xm ), φ(xn )i

m,n=1

= K(xi , xi ) − 2yi

l X

αkm ym K(xi , xm ) +

αkm αkn ym yn K(xm , xn ).

m,n=1

us

m=1

l X

cr

= yi2 hφ(xi ), φ(xi )i − 2yi

(9)

l X c X i=1 k=1

M

Jφ =

an

The kernel trick K(x, y) = hφ(x), φ(y)i used above, allows implicitly calculating the dot-product in the input space by using the kernel function instead of directly calculating the dot-product in the feature space. Similar to the notation used in [20], we let p(φk |zi ) denote the association probability relating the mapped point zi with cluster center φk . Using the kernel-induced distance measure from Eq. (9), the distortion function in the feature space becomes p(zi )p(φk |zi )Dk (zi ).

(10)

d

As no a priori knowledge of the data distribution is assumed, out of all possible distributions that yield a given value of Jφ we choose the one that maximises the conditional Shannon entropy in the feature space l X c X

te

Hφ = −

p(zi )p(φk |zi ) log p(φk |zi ).

(11)

i=1 k=1

Ac ce p

The supervised kernel-based deterministic annealing clustering algorithm 2 can then be formulated as minimising Fφ = J φ − T H φ .

(12)

It turns out [20] that according to the maximum entropy principle, the resultant distribution is the titled distribution given by Dk (zi )

p(φk )e− T p(φk |zi ) = P , c Dm (zi ) p(φm )e− T

(13)

m=1

2 The deterministic annealing (DA) algorithm was originally proposed in [20, 21] for clustering, compression, regression and related optimisation problems, while an unsupervised kernel-based version of DA was presented in [22].

6

Page 6 of 18

where p(φk ) is the mass probability of the k-th cluster in the feature space p(φk ) =

l X

(14)

p(zi )p(φk |zi ).

i=1

Fφ∗ =

min

{p(φk |zi )}

(Jφ − T Hφ ) = −T

l X

p(zi ) log

i=1

c X

p(φk )e−

k=1

∂(Fφ∗ ) = 0, ∂(φk )

us

Dk (zi ) T

Dividing this by the normalisation factor Zzi =

c X

p(φm )e−

l X p(zi )p(φk )e −Dk (zi ) T

i=1

zi =

l X p(zi )p(φk )e −Dk (zi ) T

Zzi

d

Zzi

Dm (zi ) T

M

m=1

leads to

[zi − φk ] = 0.

an

i=1

(15)

(16)

and hence p(zi )p(φk )e−

.

cr

Minimising Fφ∗ with respect to φk then yields

l X

Dk (zi ) T

ip t

Substituting Eq. (13) into Eq. (12), we obtain the free energy function in the feature space

i=1

(17)

(18)

φk .

(19)

te

Using Eq. (13), this can be rewritten as l X

p(zi )p(φk |zi )zi =

Ac ce p

i=1

l X

p(zi )p(φk |zi )φk ,

(20)

i=1

leading to

φk =

l X i=1

X p(zi )p(φk |zi ) z = αki zi . i l P i=1 p(zi )p(φk |zi ) l

(21)

i=1

Finally, from Eq. (21), we can then obtain the explicit expression of α ki as αki =

p(zi )p(φk |zi ) , l P p(zj )p(φk |zj )

(22)

j=1

which will be used to construct the vicinal kernel functions for our proposed VSVC in the next subsection. 7

Page 7 of 18

3.2. VSVC with feature space partitioning Different from standard unsupervised deterministic annealing [20, 21], the cluster center in Eq. (21) is indeed the mass center φk of the k-th vicinity in feature space, which does not necessarily have a pre-image in the input space [17]. It can be considered as virtual training data and consists of the labelled data images as (23)

k = 1, 2, . . . K

αki zi ,

ip t

φk =

l X i=1

[hz, wi + b]p(z|φk )dz v(φk )

+ C

K X

ξk

k=1

≥ 1 − ξk , i = 1, . . . , l

an

subject to yk

Z

1 T w w 2

Φ(w) =

us

minimise

cr

with a probability framework in feature space as described in Section 3.1. In contrast to partitioning in the input space, we consider partitioning in the feature space since an SVM can be generalised as a linear classifier that operates natively in feature space. Consequently, we propose an optimisation problem similar to [8] but based on feature space partitioning according to the VRM principle, that is,

ξk

≥ 0, k = 1, . . . , K,

(24)

M

where v(φk ) represents the k-th vicinity associated with mass center φk in the feature space, and p(z|φk ) is the conditional probability of the respective vicinity in the feature space. The constraint of the VSVC in feature space can be presented as Z yk [hz, wi + b]p(z|φk )dz = yk [hφk , wi + b)] ≥ 1 − ξk , (25)

d

v(φk )

which we prove in the following.

te

P ROOF. According to Bayes theorem we have p(zi )p(φk |zi ) p(zi )p(φk |zi ) = l . P p(φk ) p(zj )p(φk |zj )

Ac ce p

p(zi |φk ) =

(26)

j=1

Comparing Eqs. (21) and (26), we obtain

so that

φk =

l X

p(zi |φk )zi ,

(27)

i=1

R yk v(φk ) [hz, wi + b]p(z|φk )dz R R = yk [h v(φk ) p(z|φk )zdz, wi + v(φk ) bp(z|φk )dz] l l P P = yk [h p(zi |φk )zi , wi + bp(zi |φk )] i=1

(28)

i=1

= yk [hφk , wi + b], which completes the proof.

8

Page 8 of 18

Let us define the one-vicinal kernel as l X

Lk (x) =

(29)

yi αki K(x, xi ), k = 1, 2, ..., K,

i=1

and the two-vicinal kernel as l X l X

yi yj αki αmj K(xi , xj ), k, m = 1, 2, ..., K,

(30)

ip t

Mkm (x) =

i=1 j=1

cr

where parameters α(.) are obtained from the SKDA clustering phase as discussed in Section 3.1.

f (x) =

c X

βk yk Lk (x) + b,

k=1

us

Theorem 2. The proposed vicinal support vector classifier with feature space partitioning has the solution (31)

c X

k=1 c X

subject to

k=1

βk

c 1 X βk βm yk ym Mkm (x) 2

−

k,m=1

M

maximise W (β) =

an

where βk is the coefficient that maximises the dual functional

y k βk

= 0

βk

≥ 0.

(32)

d

P ROOF. We construct the Lagrangian function X X 1 T η k ξk , w w− βk (yk [hφk , wi + b] − 1 + ξk ) − 2 c

te

c

J(w, b, β) =

(33)

k=1

k=1

Ac ce p

where β is the Lagrangian multiplier. The parameters that minimise the Lagrangian must satisfy the conditions X ∂J(w, b, β) , βk y k φ k = 0 =w− ∂w

(34)

X ∂J(w, b, β) =− βk yk = 0. ∂b

(35)

c

k=1 c

k=1

From these, we have

w=

c X

k=1

βk y k φ k =

c X

k=1

βk y k

l X

αki zi ,

(36)

i=1

9

Page 9 of 18

and c X

(37)

βk yk = 0.

k=1

c X

βk y k

=

βk y k

=

c X

βk y k

k=1 c X

=

l X

αki hφ(x), yi φ(xi )i + b

l X

αki yi K(x, xi ) + b

i=1

k=1

(38)

αki zi i + b

i=1

k=1 c X

l X

i=1

βk yk Lk (x) + b.

an

k=1

cr

= hφ(x),

us

f (x)

ip t

We substitute Eq. (36) into the non-linear indication function f (x) = hφ(x), wi + b and obtain

The dual problem for data classification can now be formulated as =

c X

βk −

c X

βk −

k,m=1

k=1

=

c X

k,m=1

βk −

te Ac ce p

=

subject to

c P

βk y k

c l X 1 X βk βm y k y m αki αmj hyi φ(xi ), yj φ(xj )i 2 i,j=1

d

k=1

=

c 1 X βk βm yk ym hφk (x), φm (x)i 2

M

maximise W (β)

k=1 c X

βk −

k=1

c l X 1 X αki αmj yi yj K(xi , xj ) βk βm y k y m 2 i,j=1

1 2

k,m=1 c X

βk βm yk ym Mkm (x)

k,m=1

=0

k=1

βk

(39)

> 0.

This completes the proof.

In order obtain a sparse solution at the cost of the extra clustering procedure, a good selection of the number of clusters is required. In the following we consider the effect of the number of clusters c with respect to the number of training samples l: • c = 1: The VSVC solution depends on a single soft vicinity. Similar to the oneclass SVM [23, 24], the dominant class has the majority of training data inside the vicinity while the training data of the other class(es) are outside the vicinity. 10

Page 10 of 18

• 1 < c < l: The number of clusters c controls the complexity of the VSVC. If c increases (and hence the number of soft vicinities increases), the generalisation performance of the VSVC may not be good for a simple data structure but should be suitable for more complex problems.

ip t

• c = l: Each vicinity contains one and only one data point. In this case, a standard SVM solution is obtained. 4. Experimental results

cr

We evaluate the effectiveness of our proposed VSVC algorithm on both artificial and real medical classification problems. The VSVC implementation 3 is developed based on the standard SVM MATLAB program [25].

M

an

us

4.1. Artificial dataset [26] As shown in Figure 1, we consider two overlapping clusters in a two-dimensional space. Each cluster contains 200 data points. For the first cluster, the data points are uniformly located on a circle of radius 3, and normal noise with variance of 0.2 is added to the x and y co-ordinates of the points. For the second cluster, the x and y co-ordinates follow a normal distribution with mean (0, 0) and covariance matrix (5.0, 4.9; 4.9, 5.0). The intention of this dataset is to illustrate the behaviour of the VSVC method on non-separated data, and to compare its classification performance to that obtained by a standard SVM. 200 data points are used for training and the remaining 200 samples for testing. 5

d

4 3

te

2 1

Ac ce p

0

−1 −2 −3 −4

−5

−4

−3

−2

−1

0

1

2

3

4

5

Figure 1: Data distribution of the artificial dataset: samples of the first cluster are represented by crosses, while circles indicate samples of the second cluster.

3 The

Matlab code of SKDA is available upon request from [email protected].

11

Page 11 of 18

As kernel function we use a Gaussian RBF k(x, y) = e−

kx−yk2 2σ2

(40)

.

ip t

The best classification accuracy of the standard SVM is 94.5%, which is obtained with σ = 1 and C = 1. Employing the same σ and C parameters, the classification accuracies we obtained with our proposed VSVC based on different numbers of clusters c = 2, 3, ..., 8 are given in Table 1. Table 1: Classification accuracies, on the artificial data set, of a standard SVM and our proposed VSVC with different cluster numbers.

accuracy (%) 94.50 95.75 95.75 96.25 96.25 96.00 95.50 95.50

an

us

c=2 c=3 c=4 c=5 c=6 c=7 c=8

cr

SVM VSVC

d

M

From Table 1 we can see that the best classification accuracy of our VSVC is 96.25% using 4 or 5 clusters, which compares favourably to the performance of the standard SVM. The separating function of the VSVC with the best classification accuracy on the training data set is plotted in Figure 2. We can observe that for this relatively simple dataset, tuning the number of clusters c does not have a significant impact on the classification performance which is in the relatively narrow range of [95.5%; 96.25%].

te

5 4 3

Ac ce p

2 1 0

−1 −2 −3 −4

−5

−4

−3

−2

−1

0

1

2

3

4

5

Figure 2: Separating function of the VSVC with c = 4.

The sparse solution of the proposed VSVC approach can be observed from the numerical comparison provided in Table 2, which shows the learning time and the 12

Page 12 of 18

ip t

number of derived support vectors for both the standard SVM and our VSVC with different cluster numbers. For VSVC, the learning time includes both SKDA clustering time and SVM training time. From Table 2, it is apparent that the VSVC algorithm obtains a very sparse solution where the number of support vectors is equal to the cluster number, and is thus significantly smaller than the number of support vectors employed by a standard SVM. We can also notice that while the SKDA clustering time increases with larger numbers of clusters, it does so approximatively in a linear fashion. Due to the much lower SVM training time, the total learning time of VSVC is shown to be much smaller compared to that of the standard SVM.

learning time (s) 65.0 8.75 (3.05+5.70) 9.26 (3.51+5.75) 11.90 (6.06+5.84) 12.52 (6.61+5.91) 13.47 (7.45+6.02) 14.75 (8.55+6.20) 17.83 (11.41+6.42)

an

c=2 c=3 c=4 c=5 c=6 c=7 c=8

M

SVM VSVC

# support vectors 113 2 3 4 5 6 7 8

us

cr

Table 2: Number of support vectors and total learning time for a standard SVM and the proposed VSVC with different numbers of clusters for the artificial dataset (for VSVC, the learning time is calculated as the sum of the clustering time and the training time with the running times of the individual components given in brackets).

Ac ce p

te

d

4.2. Real datasets 4.2.1. Mammography dataset We further test our proposed VSVC algorithm on a real medical dataset for which the data distribution of the mass pattern is unknown and may stem from different probability distributions for different classes. In particular, we test our approach on a dataset derived from 56 mammograms obtained from the Mammographic Image Analysis Society database [27]. Of these, 29 cases were benign, while the remaining 27 were confirmed as malignant. We utilise the mass co-ordinates provided in the dataset, and examine the image content within a small neighborhood around the location to establish whether the detected mass is malignant or benign. As region of interest (ROI), we define as window of 2R × 2R pixels centered at the mass center, where R is the radius of the area around the mass center. From the ROI, we extract a set of statistical texture features4. In particular, we calculate two features from the first order gradient distribution, the mean gradient and gradient variance, and five features – namely energy, inertia, entropy, homogeneity and 4 The employed feature set is available upon request from [email protected].; for detailed definitions of the features we refer to [28, 29].

13

Page 13 of 18

Table 3: Information on the breast cancer and heart datasets, and the selected parameters for the classification algorithms.

200 77 9 2 (recurrent/non-recurrent)

170 100 13 2 (presence/absence of disease)

kernel function σ C c (for VSVC)

Gaussian RBF 5 15.19 5

ip t

heart (statlog)

cr

Parameters

# training samples # testing samples # input features # classes

breast cancer

Gaussian RBF 12 3.16 5

us

Information

Ac ce p

te

d

M

an

correlation – from the co-occurrence matrix [28, 29] in the west-east direction with a pixel distance of 1. We employ a standard SVM and our proposed VSVC to classify the dataset into benign and malignant masses based on a leave-one-out cross validation principle, i.e., by training the classifiers on all but one sample and testing it on the remaining one and repeating this for all samples in the dataset. As kernel, we employ again a Gaussian RBF function as in Eq. (40). The best classification accuracy achieved by the standard SVM is 82.10% with C = 10 and σ = 8. In contrast, the best classification accuracy of the proposed VSVC is 83.90% with c = 7 and using the same values of σ and C. With the parameters fixed at c = 7 and C = 10, the performance of VSVC can be further improved by tuning the kernel parameter σ, leading to an overall best classification accuracy of 85.70% for σ = 3 and hence a clearly better classification accuracy compared to the standard SVM. Due to the small size of the dataset, no significant difference was found between the computation times of the two methods. 4.2.2. Breast cancer and heart datasets Further experiments were conducted on another two real medical datasets, namely a breast cancer5 and a heart (statlog) dataset from the IDA benchmark repository [30]. Both datasets are originally from the UCI benchmark repository [31]; the IDA data differs in that the data has been pre-processed and “cleaned” [6] so that (1) samples with missing input features are excluded, (2) 100 random splits into training and testing sets are defined, and (3) training and testing data have zero mean and a standard deviation of one. The IDA benchmark repository also recommends the best model (parameters) for several kernel methods including SVMs through cross-validation. In our experiments, 5 The

breast cancer data originates from the University Medical Center, Institute of Oncology, Ljubljana.

14

Page 14 of 18

# support vectors learning time (s)

breast cancer

heart (statlog)

SVM VSVC

72.52 ± 4.89 74.11 ± 4.68

83.81 ± 3.46 84.77 ± 3.13

SVM VSVC

111.5 5

92.4 5

SVM VSVC

92.08 32.07

53.31 25.00

cr

Accuracy (%)

ip t

Table 4: Classification results for the breast cancer and heart datasets.

Ac ce p

te

d

M

an

us

we fix the number of clusters for VSVC to c = 5 so as to simplify the evaluation. Table 3 summarises the dataset characteristics and parameter settings. The classification results for both datasets are given in Table 4, which lists the (average) classification accuracy for both classifiers. As can be seen from there, for the breast cancer dataset a standard SVM gives an average classification accuracy of 72.52% over the 100 defined splits. In contrast, the average classification accuracy of our proposed VSVC is higher at 74.11% (with c = 5 and using the same values of σ and C). For the statlog (heart) dataset, the average classification accuracy obtained by a standard SVM is 83.81%, while our VSVC achieves a slightly better result of 84.77%. Table 4 also provides information on the number of generated support vectors and the computational time required for learning. With c = 5, the number of support vectors for VSVC is also 5. In contrast, the standard SVM leads to significantly larger number of support vectors, namely more than 110 for the breast cancer dataset and over 90 for the heart dataset. The average learning time of our proposed VSVC algorithm is 32.07s for the breast cancer dataset and 25.00s for the heart dataset. In comparison, the standard SVM takes significantly longer, namely 92.08s and 53.31s respectively, to train. This confirms that our proposed approach not only leads to improved classification performance but also to computationally more efficient classifiers. 5. Conclusions

In this paper, we have proposed a vicinal support vector classifier which is based on the vicinal risk minimisation principle for data classification. Our approach constructs new vicinal kernel functions by employing a supervised clustering algorithm, supervised kernel-based deterministic annealing, for training a support vector machine. The proposed approach proceeds in two phases: SKDA clustering and SVM learning. The aim of VSVC is to minimise the vicinal risk function under the constraints of respective vicinal areas defined by SKDA. Experimental results on both artificial and real medical datasets confirm that VSVC is a promising classifier that is adaptive to different data structures, and achieves better classification performance compared to a standard SVM. 15

Page 15 of 18

VSVC obtains a very sparse solution at the cost of the extra clustering procedure. A good selection of the number of employed clusters can make the proposed approach more adaptive to different decision problems. While the tested range of [2; 8] should work well for many datasets, in future work we aim to investigate an appropriate automatic selection of a suitable number of clusters.

ip t

Acknowledgments

The authors thank the anonymous reviewers for their insightful comments and valuable suggestions on earlier versions of the paper.

cr

References

us

[1] B. Boser, I. Guyon, V. Vapnik, A training algorithm for optimal margin classifiers, in: 5th Annual Workshop on Computational Learning Theory, ACM Press, NY, USA, 1992, pp. 144–152.

an

[2] C. Cortes, V. Vapnik, Support vector networks, Machine Learning 20 (3) (1995) 273–297. [3] E. Osuna, R. Freund, F. Girosi, Support vector machines: training and applications, Tech. rep., Massachusetts Institute of Technology, USA (1997).

M

[4] N. Cristianin, J. Shawe-Taylar, An introduction to support vector machines and other kernel-based learning methods, Cambridge University Press, Cambridge, UK, 2000. [5] B. Scholkopf, A. Smola, Learning with kernels, MIT Press, MA, USA, 2002.

te

d

[6] K. Muller, S. Mike, G. Ratsch, K. Tsuda, B. Scholkopf, An introduction to kernel based learning algorithms, IEEE Transactions on Neural Networks 12 (2) (2001) 181–201.

Ac ce p

[7] V. Vapnik, An overview of statistical learning theory, IEEE Transactions on Neural Networks 10 (5) (1999) 988–999. [8] V. Vapnik, The nature of statistical learning theory, 2nd Edition, Springer Verlag, NY, USA, 2000.

[9] S. Kulkarni, G. Harman, An elementary introduction to statistical learning theory, John Wiley & Sons, NJ, USA, 2011.

[10] I. Steinwart, D. Hush, C. Scovel, Learning from dependent observations, Journal of Multivariate Analysis 100 (1) (2009) 175–194. [11] D. Ryabko, Pattern recognition for conditionally independent data, Journal of Machine Learning Research 7 (4) (2006) 645–664. [12] M. Vidyasagar, Learning and generalization with applications to neural networks, 2nd Edition, Springer Verlag, London, UK, 2000. 16

Page 16 of 18

[13] O. Chapelle, J. Weston, L. Bottou, V. Vapnik, Vicinal risk minimization, in: Advances in Neural Information Processing Systems 13, MIT Press, MA, USA, 2001, pp. 416–422. [14] A. Cao, Q. Song, X. Yang, S. Liu, C. Guo, Mammographic mass detection by vicinal support vector machine, in: IEEE International Joint Conference on Neural Networks, IEEE, Budapest, Hungary, Vol. 3, 2004, pp. 1953–1958.

ip t

[15] F. Camastra, A. Verri, A novel kernel method for clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (5) (2005) 801–805.

cr

[16] J. Chiang, P. Hao, A new kernel-based fuzzy clustering approach: support vector clustering with cell growing, IEEE Transactions on Fuzzy Systems 11 (4) (2003) 518–527.

us

[17] M. Girolami, Mercer kernel-based clustering in feature space, IEEE Transactions on Neural Networks 13 (3) (2002) 780–784. [18] J. Leski, Fuzzy c-varieties/elliptotypes clustering in reproducing kernel Hilbert space, Fuzzy Sets and Systems 141 (2) (2004) 259–280.

an

[19] B. Scholkopf, A. Smola, K. Muller, Nonlinear component analysis as a kernel eigenvalue problem, Tech. rep., Max Planck Institute for Biological Cybernetics, Germany (1996).

M

[20] K. Rose, Deterministic annealing for clustering, compression, classification, regression, and related optimization problems, Proceedings of the IEEE 86 (11) (1998) 2210–2239.

d

[21] K. Rose, E. Gurewitz, G. Fox, Statistical mechanics and phase transitions in clustering, Physical Review letters 65 (8) (1990) 945–948.

te

[22] X. Yang, Q. Song, W. Zhang, Kernel-based deterministic annealing algorithm for data clustering, IEE Vision, Image and Signal Processing 153 (5) (2006) 557–568.

Ac ce p

[23] B. Scholkopf, J. Platt, J. Shawe-Taylor, A. Smola, Estimating the support of a high-dimensional distribution, Neural Computation 13 (7) (2001) 1443–1470. [24] D. Tax, R. Duin, Support vector data description, Machine Learning 54 (1) (2004) 45–66.

[25] S. Gunn, Support vector machines for classification and regression, Tech. rep., Image, Speech and Intelligent Systems Research Group, University of Southampton, UK (1998).

[26] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernel approach, Neural Computation 12 (10) (2000) 2385–2404.

[27] J. Suckling, J. Parker, D. Dance, S. Astley, I. Hutt, C. Boggis, et al., The mammographic image analysis society digital mammogram database, in: 2nd International Workshop on Digital Mammography, Elsevier Science, York, UK, 1994, pp. 375–378. 17

Page 17 of 18

[28] R. Haralick, Statistical and structural approaches to texture, Proceedings of the IEEE 67 (5) (1979) 786–804. [29] M. Tuceryan, A. Jain, Texture analysis, in: C. C. L. Pau, P. Wang (Eds.), Handbook of pattern recognition and computer vision, 2nd Edition, World Scientific Publishing Company, Singapore, 1999, pp. 207–248.

ip t

[30] G. Ratsch, IDA benchmark repository used for boosting, KFD and SVM papers. URL http://www.raetschlab.org/Members/raetsch/benchmark (Accessed: 26 December 2013) 26

Ac ce p

te

d

M

an

us

cr

[31] K. Bache, M. Lichman, UCI machine learning repository. URL http://archive.ics.uci.edu/ml (Accessed: December 2013)

18

Page 18 of 18