IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 11, NOVEMBER 2014

2075

Ordinal Neural Networks Without Iterative Tuning Francisco Fernández-Navarro, Member, IEEE, Annalisa Riccardi, and Sante Carloni

Abstract— Ordinal regression (OR) is an important branch of supervised learning in between the multiclass classification and regression. In this paper, the traditional classification scheme of neural network is adapted to learn ordinal ranks. The model proposed imposes monotonicity constraints on the weights connecting the hidden layer with the output layer. To do so, the weights are transcribed using padding variables. This reformulation leads to the so-called inequality constrained least squares (ICLS) problem. Its numerical solution can be obtained by several iterative methods, for example, trust region or line search algorithms. In this proposal, the optimum is determined analytically according to the closed-form solution of the ICLS problem estimated from the Karush–Kuhn–Tucker conditions. Furthermore, following the guidelines of the extreme learning machine framework, the weights connecting the input and the hidden layers are randomly generated, so the final model estimates all its parameters without iterative tuning. The model proposed achieves competitive performance compared with the state-of-the-art neural networks methods for OR. Index Terms— Extreme learning machine (ELM), neural networks, ordinal regression (OR).

I. I NTRODUCTION

L

EARNING to classify or to predict numerical values from prelabeled patterns is one of the central research topics in machine learning and data mining [1]–[4]. However, less attention has been paid to ordinal regression [(OR), also called ordinal classification] problems, where the labels of the target variable exhibit a natural ordering. In contrast to regression problems, in OR, the ranks are discrete and finite. These ranks are also different from the class targets in nominal classification problems due to the existence of ranking information. For example, grade labels have the ordering D ≺ C ≺ B ≺ A, where ≺ denotes the given order between the ranks. Therefore, OR is a learning problem in between the regression and nominal classification. Some of the fields where OR found application are medical research [5], [6], review ranking [7], econometric modeling [8], or sovereign credit ratings [9]. In statistics literature, the majority of the models are based on generalized linear models [10]. The proportional odds model (POM) [10] is a well-known statistical approach for OR, in which they rely on a specific distributional assumption on

Manuscript received August 14, 2013; revised December 3, 2013; accepted February 5, 2014. Date of publication February 21, 2014; date of current version October 15, 2014. The authors are with the Advanced Concepts Team, European Space Research and Technology Centre, European Space Agency, Noordwijk 14012, The Netherlands (e-mail: [email protected]; francisco.fernandez.navarro@ esa.int; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2304976

the unobservable latent variables (generally assuming a logistic distribution) and a stochastic ordering of the input space. OR has evolved in the last years in the machine learning field, with many achievements for the community [11], from support vector machine (SVM) approaches [12], [13] to Gaussian processes [14] and discriminant learning [15]. In the field of neural networks, Mathieson [8] proposed a model based on the POM statistical algorithm. In this paper, the POM algorithm is adapted for nonlinear problems by including basis functions in the original formulation. Crammer and Singer [16] generalized the online perceptron algorithm with multiple thresholds to perform ordinal ranking. Cheng et al. [17] proposed an approach to adapt a traditional neural network to learn ordinal ranks. This proposal can be observed as a generalization of the perceptron method into multilayer perceptrons (neural network) for OR. Extreme learning machine (ELM) is a framework to estimate the parameters of single-layer feedforward neural networks (SLFNNs), where the hidden layer parameters do not need to be tuned but they are randomly assigned [18]. ELMs have demonstrated good scalability and generalization performance with a faster learning speed when compared with other models such as SVMs and backpropagation neural networks [19]. The natural adaptation of the ELM framework to OR problems has not been yet deeply investigated. The ELM for OR (ELMOR) algorithm [20] is the first example of research in this direction. Deng et al. [20] proposed an encoding-based framework for OR, which includes three encoding schemes: single multioutput classifier, multiple binary-classifications with one-against-all decomposition method and one-against-one method. Then, the parameters of the SLFNN are determined according to the proposed encoding and the traditional ELM (the input weights are assigned randomly and output weights are estimated solving the Moore–Penrose pseudoinverse matrix). The main motivation of the proposed method is to provide a competitive algorithm, in terms of efficiency, to the stateof-the art neural networks for OR. According to this, the model proposed is an adaptation of the ELM framework for the OR scenario. The already existing ELMOR algorithm has the limitation that the parameters connecting the hidden layer to the output layer are mutually independent and hence the monotonicity of the output values of the neural network is not guaranteed. For this reason, the model proposed in this paper imposes monotonicity constraints in the weights connecting the hidden layer with the output layer (the β parameters). To do so, a reformulation of the weights is proposed [replacing the β parameters with padding variables ()]. The reformulation leads to the so-called inequality constrained

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2076

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 11, NOVEMBER 2014

least squares (ICLS) problem since all the  parameters are constrained to be greater or equal than zero. The numerical solution of the problem can be obtained by several iterative algorithms [21]. However, in this proposal, these values are determined analytically according to the closed form of the solution obtained from the Karush–Kuhn–Tucker (KKT) conditions [22]. Moreover, following the guidelines of the ELM framework [23]–[25], the weights connecting the input and the hidden layers are randomly generated, so the final model estimates all its parameters without iterative tuning. Because of the monotonicity constraint in the β parameters, the output nodes vector is considered as an estimator of the cumulative probability distribution of the classes. To maintain consistency with this assumption, an OR encoding has been adopted for the assigned values in the output nodes [20]. The main advantages of the model proposed with respect to the state-of-the-art neural networks models for OR are as follows. 1) The model proposed is more efficient than the traditional neural networks models for OR because all the parameters of the model are analytically determined instead of being iteratively tuned. 2) The model is based on a classification formulation with constraints (it is composed by J output nodes where J is the number of classes). This allows to have more flexibility than the traditional regression-based formulations (like the one of Mathieson [8] with one potential function and J − 1 thresholds). 3) The model is consistent in the monotonicity of the outputs unlike the nominal classification formulations proposed in [17] and [20]. The remainder of this paper is organized as follows. A brief analysis of neural network models for OR is given in Section II. Section III describes the model proposed and the way to estimate analytically the parameters. Section IV presents the experimental results. Finally, Section V summarizes the achievements and outlines some future development of the proposed methodology. II. N EURAL N ETWORKS FOR OR The goal of this section is to describe the main approaches available in the literature to address the OR problem from a neural network perspective. The first approach facing the ordinal problem with neural network is the one proposed by Mathieson [8] and it addresses the OR problem from a regression perspective. For this reason, the model is composed by a potential function, f (x) : R K → R, and a threshold vector, θ ∈ R J −1, where K represents the number of attributes and J the number of classes. The potential function of the model is a regression-type neural network [19], [26], [27], as the one shown in Fig. 1. In this approach, the potential function intends to uncover the nature of the assumed underlying outcome, and the threshold vector estimates the possibly different scales around different ranks. Thresholds must satisfy the constraint θ1 < θ2 < · · · < θ J −1 and their role is to divide the real line into J contiguous intervals; these intervals map the potential

Fig. 1.

Structure of the probabilistic neural network model.

function value f (x) into the corresponding discrete variable, while enforcing the ordinal constraints. The model is similar to the POM, but it uses a nonlinear transformation of the inputs, instead of a linear one. In fact, the model can be observed as an extension of the POM model that replaces the linear covariates of the POM model by nonlinear basis functions. Therefore, this model will be referred to as the neural network based on the POM (NNPOM). It is equivalent to the one proposed in [28], the ordinal neural network (ONN) model, if the activation function of the output nodes is fixed to the log-sigmoid transfer function (logsig function), and the model is trained to predict the posterior probabilities when fed with the original input variables and the variables generated by the data replication method. The predicted thresholds are the weights of the connection of the added J − 2 components. The NNPOM model is also closely related to the perceptron learning (PRank) algorithm [16], [29]. In fact, the authors propose a perceptron online learning algorithm with the structure of threshold models. It is then further extended by approximating the Bayes point [30]. Whereas a kernelized generalization using joint kernel functions is proposed in [31]. An extension of the PRank algorithm [29] using neural networks is introduced in [17]. They replace the standard softmax function of the output nodes with a sigmoidal function, because they considered OrderedPartitions encoding instead of the 1-of- J traditional encoding (Table I). The pattern is classified with rank corresponding to the first output node lower than a predefined threshold (in this paper, the threshold was 0.5). The algorithm is no consistent in its predictions. For example, it ignores the possibility of having an output node associated with a value higher than the threshold, after the selected rank. For simplicity of notation, this model is referred to as the neural network using ordered partitions (NNOP) algorithm.

FERNÁNDEZ-NAVARRO et al.: ORDINAL NEURAL NETWORKS WITHOUT ITERATIVE TUNING

TABLE I E NCODING S CHEMES FOR OR

TABLE II S UMMARY OF C HARACTERISTIC OF N EURAL N ETWORKS AND

S TATISTICAL M ODELS (POM) FOR OR

ELM has been also adapted to OR [20], the so-called ELMOR, and one of the proposed ordinal ELMs also considers OrderedPartitions encoding. Additionally, multiple models are also trained using binary decomposition and the OrderedPartitions approaches. For the prediction phase, the loss-based decoding approach [32] is adopted, i.e., the selected rank is the one that minimizes the exponential loss j ∗ = arg min d L (Mi , f(x))

(1)

1≤i≤ J

where Mi is the encoding scheme associated to class i (i.e., the i th row of the encoded matrix in Table I), f(x) : R K → R J is the vectorial potential function (with f(x) = ( f 1 (x), . . . , f J (x)), each component representing the output of the corresponding category), and d L (Mi , f(x)) is the exponential loss function d L (Mi , f(x)) =

J 

  exp −m i j · f j (x)

2077

if the monotonicity of the output values per class is guaranteed or not. As observed in Table II, the models composed by J output nodes are inconsistent in the monotonicity of the output values per each class. In the proposed approach, monotonicity constraints in the parameters connecting the hidden layer to the output layer are imposed to address the above-mentioned issue. Therefore, in a certain form, the output of one node depends on the others. This allows modeling the output layer of the neural network as an estimation of the cumulative distribution of the outputs. Note that, because of the imposed constraints on the parameters, the model is consistent in the monotonicity of the outputs unlike the models proposed in [17] and [20]. In the field of ensemble learning, Fernández-Navarro et al. [38] recently propose a modified version of the negative correlation framework for OR. They study two versions of the base algorithm. The first one assuming a fixed thresholds configuration and a second one with adaptive thresholds. Finally, Pérez-Ortiz et al. [39] propose also an ensemble threshold model based on computing different classification tasks through the formulation of different-order hypotheses. Every single model is trained to distinguish between the one given class j and all the remaining ones, while grouping them in those classes with a rank lower than j , and those with a rank higher than j . The methodology proposed can be considered as a reformulation of the well-known one-versus-all scheme. III. P ROPOSED M ETHOD The main characteristics of the proposed algorithm are detailed in the following sections. First, the encoding scheme adopted and the neural network model used are presented and motivated, then the probabilistic interpretation of the outputs is discussed followed by the description of the procedure to derive analytically the model parameters. A. Encoding Scheme

(2)

j =1

where [m i j ]i, j =1,...,J = [Mi ]i=1,...,J = M ∈ R J × R J . On the other hand, Costa [33] proposes a probabilistic neural network for the ordinal scenario. To adapt neural networks to the ordinal case structure, targets are reformulated following the OneVsFollowers approach and the prediction phase is realized considering that the output of the j th output neuron is estimating the probability that j and j − 1 ranks are both truth [28], [34], [35]. From another perspective, Sánchez-Monedero et al. [36] convert the OR problem into a standard regression one using insights about the classes distribution in the input space given by pairwise distance calculations. After that, the final target variable is estimated using the support vector regression method. The final classifier is called pairwise class distances ordinal classifier. Recently, Seah et al. [37] design ordinal classifiers in the context of transfer ordinal label learning. In Table II, the models introduced above are categorized according to the following attributes: degree of linearity of the model, training mode (online or batch), number of outputs, and

Assume that a training sample set D = {(X , Y)} = (1) (K ) N {(xn , yn )}n=1 is available, where xn = (x n , . . . , x n ) is the random vector of characteristics taking values in  ⊂ R K , and the label, yn , belongs to a finite set C = {C1 , . . . , C J }. For a standard classification neural network without considering the order of categories, the goal is to predict the probability of a pattern xn of belonging to one category j . For simplicity in the notation, from now on, if a pattern xn belongs to a class C j , the (C j , j )-enumeration is used for the labels so, in this case, yn = C j := j . The goal is to learn a function to map input vector xn to a probability distribution vector P = (P(yn = 1|xn , z), P(yn = 2|xn , z), . . . , P(yn = j |xn , z), . . . , P(yn = J |xn , z)), where P(yn = j |xn , z) is closer to 1 andother elements close to zero, subjected to J the constraint j =1 P(yn = j |xn , z) = 1, and indicating with z the set of model parameters defined in the following section. This encoding is the so-called 1-of- J encoding (i.e., yˆ (x) = [ yˆi (x)]i=1,...,J with yˆi (x) = 1 if x is a pattern of the i th class, yˆi (x) = 0 otherwise). In contrast, the proposed neural network model considers the order of the categories. That is, there exists an order

2078

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 11, NOVEMBER 2014

relation between these labels, such as C1 ≺ C2 ≺ · · · C J , where ≺ denotes the given order between different ranks. If a pattern xn belongs to category C j , it is classified automatically into lower-order categories (C1 , C2 , . . . , C j −1 ) as well. Therefore, the target vector of xn , using the ordered partitions encoding, is yˆˆ (xn ) = (0, 0, . . . , 0, 1, 1, 1), where yˆˆi (xn ) = 0 for all 1 ≤ i < j and yˆˆi (xn ) = 1 otherwise, as shown in Table I. The formulation of the target vector is similar to the perceptron approach [29]. It is also related to the classical cumulative probit model for OR [10], in the sense that the output nodes vector is considered as a certain estimation of the cumulative probability distribution of the classes (a monotonicity constraint in the weights connecting the hidden layer to the output layer is imposed). The estimation of the cumulative and the posterior probability is further discussed in Section III-C.

B. Neural Network Model As previously stated, the model proposed is very similar to a standard classification neural network model but including the monotonicity constraints. For this reason, the model is composed by J potential functions: f j (x) : R K → R with j = 1, . . . , J , and a hidden layer (with corresponding basis functions). The final output of each class can be described with the following: f j (xn ) =

S 

j

βs · Bs (xn )

(3)

Fig. 2.

Structure of the probabilistic neural network model proposed.

s=1

where S is the number of neurons, Bs (xn ) is a nonlinear mapping from the input layer to the hidden layer (basis j j j functions), and β j = β1 , β2 , . . . , β S is the connection weights between the hidden layer and the j th output node. In this paper, the selected basis function is the sigmoidal function. Therefore  K  1 (i) Bs (xn ) = σ wsi x n , σ (x) = (4) 1 + e−x i=1

where ws = [ws1 , ws2 , . . . , ws K ] is the connection weights between the input layer and the sth basis function. Fig. 2 represents the structure of the model proposed. From now on, the model parameters set is defined as z = {w, β}, where w ∈ R S × R K and β ∈ R S × R J . Note that the parameters connecting the hidden layer to the j j =1,...,J output layer {βs }s=1,...,S in a nominal classification model are not ordered. In this proposal, they are considered to be ordered and the following constraint ensures their monotonicity: j βs



j +1 βs

∀ j = 1, . . . , J. ∀s = 1, . . . , S.

functions s = 1, . . . , S: ⎧ 1 β = 1s ⎪ ⎪ ⎨ s2 βs = 1s + 2s ... ⎪ ⎪ ⎩ J βs = 1s + 2s + · · · + sJ subject to j

s ≥ 0 ∀ j = 1, . . . , J. In this way, the parameters are restricted to assume positive J values β s = (βs1 , βs2 , . . . , βsJ ) ∈ R+ . Equation (6) can be expressed in matrix form, for the sth basis function, as follows: ⎛ 1⎞ ⎛ ⎞ ⎛ 1⎞ 1 0 0 βs s ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ (7) ··· 0⎠ · ⎝ ··· ⎠ ⎝···⎠ = ⎝··· βsJ

Because the inequality (5) is defined for each pair of parameters of each basis function, the parameters of different basis functions have their own structure. The monotonicity condition can be reformulated as follows, for all the basis

1

···

1

sJ

or, in the reduced form β s = C · s ∀s = 1, . . . , S RJ ,

(8)

× the column vector s ∈ Then, where C ∈ the matrix form considering all the vectors β s of all basis functions is provided as ⎛ 1 ⎞ ⎛ ⎞ ⎞ ⎛ 1 β1 · · · β S1 1 · · · 1S 1 0 0 ⎜ ⎟ ⎜ ⎟ ⎟ ⎜ ··· ··· ⎝ ⎠ = ⎝··· ··· 0⎠ · ⎝ ⎠ (9) 1 ··· 1 β1J · · · β SJ 1J · · ·  SJ RJ

(5)

(6)

J R+ .

FERNÁNDEZ-NAVARRO et al.: ORDINAL NEURAL NETWORKS WITHOUT ITERATIVE TUNING

J × R S . Finally, the complete equation in where β,  ∈ R+ + matrix form is

β = C · .

(10)

C. Probabilistic Interpretation of the Outputs Given the accumulated behavior of the output nodes: j j +1 are f j (xn ) ≤ f j +1 (xn ), because the constraints βs ≤ βs imposed in the model formulation, the cumulative probability for the j th output node is defined as f j (xn ) . P(yn ≤ j |xn , z) = f J (xn )

(11)

The model proposed approximates the posterior probability of a pattern to belong to category C j as

2079

For example, in the case of the modified ELM for generalized RBF (GRBF) algorithm, the centers of each GRBF were taken randomly from the patterns of the training set and the radius and τ (i.e., the exponent of the RBF) values were determined analytically, considering that the model must fulfill two constraints: locality and coverage. Finally, the output weights are approximated as βˆ H† Yˆ

where H† is the pseudoinverse of H. Similar to the ELM framework, the model proposed for OR is derived from the solution of the following linear system (with the OrderedPartitions encoding scheme), derived from the estimation of the model parameters in (10): Yˆ = H · β T = H · (C · )T

P(yn = j |xn , z) = P(yn ≤ j |xn , z) − P(yn ≤ j − 1|xn , z) ∀ j = 2, . . . , J P(yn = 1|xn , z) = P(yn ≤ 1|xn , z).

(12)

Finally, using the posterior probability estimation of (12), the class predicted by the model corresponds to the class whose output value is the greatest. In this way, the optimum classification rule C(x) is the following: C(x) = l ∗ ,

Recently, an efficient learning algorithm, called ELM, for SLFNNs has been proposed in [40]. The minimum of the least squares problem (minimization of the squared error function) is the solution of the following linear system: (14)

where H is the hidden layer output matrix of the SLFN and defined as

(20)

ˆ (CT )−1 = H · T Y·

(21)

 ≥ 0.

(22)

Noting with Y˜ := Yˆ · (CT )−1 and with Y˜ j the j th matrix column, the following constrained quadratic programming problems, for all j = 1, . . . , J , estimate the padding variables minimizing the quadratic errors:  T   Minimize Y˜ j − H · Tj Y˜ j − H · Tj  j ∈R S

subject to  j ≥ 0. (15)

(16)

and β = (β 1 , β 2 , . . . , β S ) ∈ R J × R S .

 ≥ 0.

subjected to

D. Parameter Estimation

Yˆ = (ˆy1 , yˆ 2 , . . . , yˆ N )T ∈ R N × R J

(19)

Equivalently, the system can be defined as

1≤l≤ J

H = (h1 , h2 , . . . , h S ) ⎞ ⎛ B1 (x1 ) . . . B S (x1 ) ⎟ ⎜ ... . . . ⎠ ∈ R N × RS = ⎝ ... B1 (x N ) . . . B S (x N )

= H · T · CT subjected to

where l ∗ = arg max P(yn = j |x, z). (13)

Hβ T = Yˆ

(18)

(17)

Traditionally, ELM sets the values for the hidden layer parameters w randomly when sigmoidal nodes are considered [19], [23], [40]. In the case of radial basis function (RBF) models, the centers of each basis function are typically determined selecting randomly patterns from the training set [25]. Despite this, in the original ELM-RBF algorithm [41], the centers of the basis functions are initialized randomly (as the radius value) without considering the patterns of the training set. It is also important to point out that more advanced RBFs were also proposed in the ELM framework [42], [43].

(23)

This is the so-called ICLS problem. Its numerical solution can obtained by several gradient based algorithms, for example, by the trust-region-reflective algorithm (the lsqlin method in MATLAB) or by the active-set algorithm [21]. The main problem of these techniques is that they are iterative methods and, therefore, computationally more expensive than the methods based on ELM. To determine analytically the β matrix a closed form for the solution of the ICLS problem is needed. This can be estimated from the KKT conditions of the convex minimization problem [22]. Fomby et al. [44], proposed a closed-form solution for the generic ICLS problem  T   Y − X · βT Minimize Y − X · β T β∈R K

subject to A · β ≥ 0

(24)

where A is a suitable matrix for the model and β in this case represents the weights connecting the initial covariates (X ∈ R N × R K ) with the output (Y ∈ R N ). The solution takes the

2080

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 11, NOVEMBER 2014

following form: ⎧ if A · βˆ ≥ 0 : βˆ ⎪ ⎪ ⎨  −1 ˆ β ∗ = if A2 · βˆ < 0 : βˆ + KA2T A2 KA2T (0 − A2 β) ⎪ ⎪   ⎩ −1 ˆ if A · βˆ < 0 : βˆ + KAT AKAT (0 − Aβ) (25) where K := (X T X )−1 and βˆ is the ordinary least squares (OLS) estimator βˆ = KX T Y X † Y. Besides, the submatrix A2 is defined as   A1 A= A2

(26)

Fig. 3.

NNORELM framework. TABLE III C HARACTERISTICS OF THE B ENCHMARK D ATA S ETS , O RDERED BY THE N UMBER OF C LASSES

(27)

where A1 · βˆ ≥ 0 and A2 · βˆ < 0. Trivially, if A · βˆ ≥ 0 holds then the solution of the OLS problem is also solution of the ICLS one. Otherwise, according to [44], to ensure that β ∗ is a feasible solution to the ICLS problem, the following sufficient condition must hold:   −1  ˆ ≥0 if A2 · βˆ < 0 : A1 KA2T · A2 KA2T (0 − A2 β) ˆ ≥ 0. if A · βˆ < 0 : [AKAT ]−1 (0 − Aβ)

(28)

For the particular case of the constrained optimization problem defined in (23), the matrix A is equal to the identity matrix and the solution of the j th ICLS problem reduces to ⎧ ˆ j ≥0: ˆj if  ⎪ ⎪ ⎪ ⎪    T ⎪ T ⎨if A ·  ˆ j + LAT A2 LAT −1 2 ˆ j

Ordinal neural networks without iterative tuning.

Ordinal regression (OR) is an important branch of supervised learning in between the multiclass classification and regression. In this paper, the trad...
2MB Sizes 2 Downloads 6 Views