This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

1

Canonical Correlation Analysis on Data With Censoring and Error Information Jianyong Sun and Simeon Keates

Abstract— We developed a probabilistic model for canonical correlation analysis in the case when the associated datasets are incomplete. This case can arise where data entries either contain measurement errors or are censored (i.e., nonignorable missing) due to uncertainties in instrument calibration and physical limitations of devices and experimental conditions. The aim of our model is to estimate the true correlation coefficients, through eliminating the effects of measurement errors and abstracting helpful information from censored data. As exact inference is not possible for the proposed model, a modified variational Expectation-Maximization (EM) algorithm was developed. In the algorithm developed, we approximated the posteriors of the latent variables as normal distributions. In the experiment, the modified E-step approximation accuracy is first empirically demonstrated by being compared to hybrid Monte Carlo (HMC) sampling. The following experiments were carried out on synthetic datasets with different numbers of censored data and different correlation coefficient settings to compare the proposed algorithm with a maximum a posteriori (MAP) solution and a Markov ChainEM solution. Experimental results showed that the variational EM solution compares favorably against the MAP solution, approaching the accuracy of the Markov Chain-EM, while maintaining computational simplicity. We finally applied the proposed algorithm to finding the mostly correlated properties of galaxy group with the X-ray luminosity. Index Terms— Canonical correlation analysis (CCA), censored data, latent variable model, measurement errors.

I. I NTRODUCTION

C

LASSICAL canonical correlation analysis (CCA) first developed by Hotelling [25], aims to find linear correlation coefficients for datasets with two views, i.e., two sets of random variables. It is shown to be equivalent to Fisher linear discriminant analysis (LDA) [5], [23]. Particularly, it is proven that classical CCA is equivalent to a least-square problem. Multivariate regression can be considered as a subcase of CCA. In addition, under mild condition, it can be shown that CCA is equivalent to a least-squares problem for multilabel classification problem [51]. To deal with various practical datasets, classical CCA is extended over the years. For example, kernel CCA [17], [33], [34] are developed to discover nonlinear correlation among variables. Sparse CCAs [21], [58], [59] are developed

Manuscript received May 16, 2012; revised February 19, 2013; accepted April 30, 2013. The work of J. Sun was supported by the National Natural Science Foundation of China under Grant 61273313. The authors are with the School of Engineering, Computing and Applied Mathematics, The University of Abertay Dundee, Dundee, U.K. (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2262949

for data sets with small number data points but large number of variables. In addition, multiple CCA [54], [59] are developed to address multiple variable sets, while supervised CCA [59] aims for variable sets with responses. Recently in [41], a constrained CCA was developed for categorical variables and an adaptive CCA algorithm was developed in [61] for tracking principal correlations over time. These CCA extensions are usually formulated as generalized eigenvalue problems. CCA and its variants are adopted widely for unsupervised learning, such as dimensionality reduction and visualization [3], [5], and for supervised learning such as classification [19] and feature selection [45]. In addition, CCA and its variants are applied to solving various real problems, such as gene expression network inference [55]–[57], cell motif identification and prediction [43], fMRI data analysis [6], [9], [29], [36], [38], video analysis [31], labor electrohysterogram analysis [22], time series analysis [15], drug screening [46], and many others. In some real-world applications, available datasets are not always exact and complete. For example, in many scientific areas as diverse as biochemistry, meteorology, or astrophysics, some target objects are detected but have explicit measurement errors from uncertainties in instrument calibration and physical limitations of the devices and experimental conditions, while some are even not detected (missing completely or censored). The undetected (censored) objects are assigned detection limits to the values of objects’ property based on the uncertainty of the unsuccessful measurement. The existence of the uncertainties and data censoring will certainly bias the estimation of the canonical correlation coefficient. In addition, it is rather difficult for classical CCA and its extensions in literature to deal with the situations when the dataset is censored and/or contains measurement errors [20] based on the eigenvalue formulation of these CCAs. However, as is well known, probabilistic formulation has the advantage of, including various additional information in a principled way. The probabilistic interpretation of CCA as seen in [5], [32] allows us to incorporate the censoring and measurement error information in a straightforward manner. In the literature, analysis of censored data is often called survival analysis [12], [28], which is developed to deal with death in biological organisms and failure in mechanism systems. Research on survival analysis is very rich, mostly on censored regression models, such as linear regression [13], nonlinear regression [24], [48], nonlinear mixed effects model [10], and others. In classical statistics, models known as errors in variables exist, such as the total least square approach for robust regression [27]. Probabilistic approaches able to

2162-237X/$31.00 © 2013 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

propagate uncertainty and censoring have also started to appear recently [18], [37], [60] for certain problems, and their benefits are convincingly demonstrated. However, to the best of our knowledge, no research was carried out to consider both censored and missing data in CCA, except in [53] where missing at random data (but not censored data) is considered. In addition, survival analysis considers censored responses that are positive. In real datasets, such as in functional genomics data [14], gastric, and lung cancer data [40], the responses can be negative values. Because of the importance of this issue in scientific data mining, we addressed this problem based on the probabilistic interpretation of CCA developed by Bach and Jordan [5]. Particularly, we intended to develop a CCA algorithm for a galaxy group dataset to analyze the correlation between the characteristics of galaxy and its evolutionary state. The application of the developed algorithm is not limited to the galaxy group dataset, other datasets as in genomics and cancer can also be analyzed. In the rest of this paper, Section II described the probabilistic formulation of CCA, the models on censored data and measurement error. Section III presented a CCA model for analyzing data with measurement errors on both two views, and censoring in the first view which is 1-D. Further, a generalized EM (GEM) algorithm is presented in this section. In Section IV, controlled experiments were carried out to study the developed model and algorithm. The application of the developed algorithm is presented in Section V. The extended model and corresponding inference algorithm is described in Section VI. Discussion and future work are given in Section VII. Section VIII concluded this paper.

y2 Fig. 1.

y1

Plate diagram of the probabilistic CCA.

applications to astronomical data (e.g., [30], [49], [50], [52]).1 Denoting the standard deviation as S, the distribution of a data point y can be written as p(y|x) = N (y|x, S 2 ). The unknown mean values x represent the clean, error-free version of the data. C. Probabilistic CCA (PCCA) CCA is defined to maximize the correlation between the linear projections of variables y1 and y2 . Consider the linear projections y1T w1 and y2T w2 , the function to be maximized in CCA is w1T E(y1 y2T )w2 ρ=  w1T E(y1 y1T )w2T E(y2 y2T )w2 where E(·) is the expectation of the random variable. In the probabilistic interpretation of CCA for two variable sets (with dimensions d1 and d2 ) proposed by Bach and Jordan [5], they proved that the maximum likelihood estimation (MLE) solution of the following latent variable model (assuming Gaussian distributions2) leads to CCA z ∼ N (0, Id ); min{d1 , d2 } ≥ d ≥ 1 y1 |z ∼ N (W1 z + μ1 , 1 ), W1 ∈ Rd1 ×d

II. P RELIMINARIES A. Model on Censored Data Data censoring can be categorized as left, right, and interval censoring. The interval censoring is that a datum y can be detected only when it satisfies l ≤ y ≤ u. By setting l = −∞, we obtain a left-censored data (u is often called upper limit); by setting u = +∞, we obtain a right-censored data. If we introduce a response indicator R for each data point y ∈ R, where R = 1 means y is observed, and R = 0 indicates the observation are missing, the log-likelihood can be written as follows (see [28])   (1) L(y, R) = log p(R|y, θ )1−R p(y|θ ) R where θ is the model parameter. In the particular case of leftcensored data, the probability that a datum being missing can be calculated explicitly p(R = 0|y, u) = p(y < u) = F(u)

z

(2)

where F(·) is the cumulative distribution function (cdf) of y. Note that given y and u, we know whether R is zero or not. B. Measurement Error Model In this paper, we assumed that the measurement errors are normally distributed as applied and justified in various

y2 |z ∼ N (W2 z + μ2 , 2 ), W2 ∈ Rd2 ×d . Fig. 1 shows the plate diagram of the model. In the figure, it can be seen that variables y1 and y2 are conditionally independent given the latent variable z, and the latent variable z is modeled as a normal distribution. III. CCA FOR DATA W ITH C ENSORING AND M EASUREMENT E RROR A. Proposed Model We developed a new latent variable model for a leftcensored dataset with upper limits and measurement errors. Fig. 2 shows the plate diagram of the proposed model. Without loss of generality, we first assume the dataset with upper limits is 1-D. It is particularly assumed for the application to the galaxy group dataset as described in Section V, but it can also be applied to other datasets, such as functional genomics data [14], gastric and lung cancer data [40]. It will 1 This is obviously an assumption that relies on the prior knowledge derived from the properties of specific domain. Other distribution assumptions, e.g., Weibull distribution [11], [35], [47], could be applied. 2 If one of the set of input variables contains class indicator, the CCA can be applied for multiclass classification, i.e., Fisher discriminant analysis [5], [51]. In this case, the Gaussian assumption cannot be directly applied.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. SUN AND KEATES: CCA ON DATA WITH CENSORING AND ERROR INFORMATION

z

x2

x1

y2

T Fig. 2.

u

3

Without loss of generality, we assumed that the first n 0 data points in the full dataset are observed (i.e., Rn = 1) with measurement errors Sn , 1 ≤ n ≤ n 0 , the rest of the dataset are upper limits u n , n 0 + 1 ≤ n ≤ N. The full data log-likelihood can be written as follows (see (1)):   N   L= log p(y1n |z n ) Rn p(Rn |u n , z n )1−Rn n=1

R

S

y1

Directed acyclic graph representations of the proposed model.

p(z n ) p(y2n |z n ) dz n  n0  = log p(z n ) p(y1n |z n ) p(y2n |z n )dz n + n=1 N 

 log

p(z n ) p(y2n |z n ) p(Rn = 0|u n , z n )dz n (3)

n 0 +1

be generalized to a multivariate scenario in Section VI. In the model, previously described models for PCCA, measurement error and censored data are combined together. From Fig. 2, it can be seen that the response indicator R also depends on the latent variable z; the limit u and the standard deviations representing the measurement errors are properly embedded in the model. In addition, the model for PCCA is now assumed over the error-free data, i.e., x 1 and x2 , which are latent variables. Thus, it can be expected that the proposed model can capture the true correlations between the views provided that the assumptions are not violated and the inference for the latent variables z, x 1 and x2 is accurate. Analogous to PCCA, the probability distributions associated with the arrows in the plate diagram are as follows: p(z) = N (z|0, 1) p(R = 0|u, z) = p(x 1 < u|z) =



u −∞

p(x 1 |z)d x 1

p(y1|x 1 ) = N (y1 |x 1 , S 2 ) p(y2 |x2 ) = N (y2 |x2 , T) p(x 1 |z) = N (x 1 |w1 z + μ1 , σ12 ) p(x2 |z) = N (x2 |W2 z + μ2 , 2 ). In the model, the measurement errors are assumed to be normally distributed in p(y1 |x 1 ) and p(y2 |x2 ), where x 1 and x2 represent the clean, error-free version of the data. Note that we assumed the regression model for x 1 and x2 rather than the contaminated y1 and y2 . We used dotted line from x 1 to R since the probability of R = 0 is implicitly independent of x 1 given u, the upper limit. We say “implicitly” as R equals to zero if and only if x < u, while the probability distribution (2) is not conditioned on x. In the rest of the section, we denote       μ1 w1 y1 μ= W= y= y2 μ2 W2  2   2  σ1 0 S 0 = M= 0 2 0 T where y combines the two views, μ, W,  and M are the combination of the parameters associated with the two views.

where p(y1n |z n ) = N (y1n |w1 z n +μ1 , Sn2 + σ12 ) and p(y2n |z n ) = N (y2n |W2 z n + μ2 , 2 + Tn ) which is a result of the convolution of two normal distributions. As there is no observable data y1n for n > n 0 , the second integration is thus independent of the censored data. B. GEM Solution 1) The Log-Likelihood Bound: As the integration in L2 (see (3)) is intractable due to p(R = 0|u n , z n ), we have to resort some approximation methods. We developed a GEM algorithm (readers are referred to [8], [42] for recent advances on GEM) with approximated variational E-step. As we know, the log-likelihood of a datum yn = [y1n y2n ]T can be lowerbounded as follows:  p(z n , yn , Rn ) dz n q(z n ) log log p(yn , Rn ) ≥ q(z n )  = q(z n ) log p(z n , yn , Rn )dz n + H (q) ≡ F (yn , Rn |q) where q(·) is the auxiliary posterior distribution, F is called free-energy, p(z n , yn , Rn ) = p(y n , Rn |z n ) p(z n ) = p(Rn |u n , z n )1−Rn p(y1n |z n ) Rn p(y2n |z n ) p(z n ) is the full likelihood, H (q) is the entropy H (q) = − q(z n ) log q(z n )dz n . In the variational E-step, we need to maximize the log-likelihood bound with regard to the auxiliary distribution q(z n ). Since in variational EM solutions, the log-likelihood bound holds for any auxiliary distribution q, here we assume q(z n ) follows a normal distribution with mean z n and variance σ 2 (z n ). To infer the posterior q(z n ), we need to evaluate the expectation of the log likelihood bound with respect to q(z n ), 1 ≤ n ≤ N. If we write n0 

F1 = log p(z n ) p(y1n |z n ) p(y2n |z n ) F2 =

n=1 N 



log p(z n ) p(Rn = 0|u n , z n ) p(y2n |z n )

n 0 +1

F3 =

N  n=1

log q(z n )

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

where · denotes the expectation with respect to q(z n ), then the log-likelihood bound can be written as F = F1 + F2 − F3 . In detail, the terms F1 , F2 , F3 can be written as follows (omitting constant terms)  n0  

1 2 n n F1 = − z n + log p(y1 |z n ) + log p(y2 |z n ) 2 n=1  N  

1 − z n2 + log p(y2n |z n ) + log p(Rn |u n , z n ) F2 = 2 n 0 +1

F3 = −

N   1 n=1

2

 log(σ 2 (z n )) .

Let μ˜ n1 = w1 z n + μ1 , and μ˜ n2 = W2 z n + μ2 , and (·) be the standard normal cdf, Tr be the trace of the matrix, we have 1 1 log p(y2n |z n ) = − log |2 + Tn | − 2yn 2 2 2 1 1 log p(y1n |z n ) = − log |σ12 + Sn2 | − 2y n 2 2 1 where u n − μ˜ n1 ) σ1 

   = y2n − μ˜ n2 [2 + Tn ]−1 y2n − μ˜ n2   +Tr [2 + Tn ]−1 W2 W2T σ 2 (z n )  n 2 y1 − μ˜ n1 + w12 σ 2 (z n ) = . σ12 + Sn2

p(Rn = 0|u n , z n ) = ( 2yn 2

2y n 1

Since log p(Rn = 0|u n , z n ) cannot be evaluated analytically, we approximate it at the mode of q(z n ), that is log p(Rn = 0|u n , z n ) ≈ log p(Rn = 0|u n , z n )    u n − μ˜ n1 = log  (4) σ1 where μ˜ n1 = w1 z n + μ1 . 2) Approximated Variational E-Step: The objective function Fz with respect to z n , 1 ≤ n ≤ n 0 is as follows:   2  n0  1 y1n − w1 z n − μ1 1 2 Fz = − z n − 2 2 σ12 + Sn2

Solving the stationary equation of the log-likelihood function with respect to z n and σ 2 (z n ), 1 ≤ n ≤ n 0 , we obtain the following:

σ 2 (z n ) =

n=1

In the variational E-step, the parameters of the auxiliary distribution q can be obtained by maximizing the loglikelihood bound F . Before we proceed, notice the following equalities:  un ∂ ∂ p(Rn = 0|u n , z n ) = p(x 1n |z n )d x 1n ∂z n ∂z n −∞ = −w1 N (u n |w1 z n + μ1 , σ12 ).

1 + Tr([ + M]−1 WWT )

(5) .

(6)

For n 0 + 1 ≤ n ≤ N, no closed form is available for z n , and the gradient of F with respect to z n is as follows: w1 N (u n |w1 z n + μ1 , σ12 ) ∂Fz = −z n − ∂z n p(Rn = 0|u n , z n ) + W2T [2 + Tn ]−1 (y2n − W2 z n − μ2 ). Solving the stationary equation of the log-likelihood function with respect to σ 2 (z n ), n ≥ n 0 , we obtain the following: σ 2 (z n ) =

1 . 1 + Tr([2 + Tn ]−1 W2 W2T )

3) Variational M-Step: The objective function related to μ1 , σ1 and w1 is as follows:  n0   1 1 Fs = − log(σ12 + Sn2 ) − 2y n 2 2 1 +

n=1 N 

log p(Rn = 0|u n , z n ).

n 0 +1

In the variational M-step, we fixed the auxiliary distribution, and maximize F with respect to the parameters of the model. No close form is available for μ1 , σ1 and w1 . The gradient of the log-likelihood with respect to w1 is as follows: n0  (y1n − w1 z n − μ1 )z n − w1 σ 2 (z n ) ∂Fs = ∂w1 σ12 + Sn2 n=1



N  z n N (u n |w1 z n + μ1 , σ12 ) . p(Rn = 0|u n , z n )

n 0 +1

The gradient of F with respect to μ1 is as follows:   n0  y1n − w1 z n − μ1 ∂Fs = − ∂μ1 σ12 + Sn2 n=1 N 

n 0 +1

n=1

n0  

   1   − y2 − μ˜ n2 [2 + Tn ]−1 y2 − μ˜ n2 . 2

WT [ + M]−1 (yn − μ) 1 + WT [ + M]−1 W 1

z n =

N (u n |w1 z n + μ1 , σ12 ) . p(Rn = 0|u n , z n )

The gradient of Fs with respect to σ1 is not available. To proceed, we chose to use the simplex method [16]. The gradient of μ2 and W2 can be evaluated analytically. Particularly, considering the gradient of μ2 to be zero, and solving it, we obtain the following: −1   −1 μ2 = (2 + Tn ) n

×

 N  n=1

(2 + Tn )

−1



y2n

  − W2 z n .

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. SUN AND KEATES: CCA ON DATA WITH CENSORING AND ERROR INFORMATION

Zeroing the gradient of W2 , and solve it, we obtain the following:  N   −1   −1 2 2 W2 = z n + σ (z n ) (2 + Tn ) n=1

 N 

(2 + Tn )

−1

  n  y2 − μ2 z n .

The problem above cannot be solved for 2 analytically. The gradient of the function F with respect to 2 is the following:   ∂F = n + σ 2 (z n )W2 W2T (2 + Tn )−2 ∂2 N

n=1

 T where n = y2n − μ˜ n2 y2n − μ˜ n2 . 4) The Correlation Coefficient: According to Archambeau [2], the correlation directions can be calculated through the estimated parameters as follows: −1 ˆ −2 W1 (Id1 − B−1 A1 U1d = 11 1 ) 1

−1 ˆ −2 W2 (Id2 − B−1 U2d = 22 A2 2 ) 1

(7)

where 11 , 22 and 12 , 21 are the estimated within- and between-covariance matrices. They can be evaluated by using ˆ 1, W ˆ 2 , ˆ1 and ˆ2 the estimated parameters W

11

In the variational E-step of the proposed algorithm, we assumed that q(z) is a normal distribution, and approximated the expectation of log p(R|u, z) as log p(R|u, z ). To evaluate the accuracy of the approximation, we used Markov Chain Monte Carlo (MCMC) method to sample a set of data points from the posterior p(z|y1, y2 , R). The mean and variance of these samples, and Monte Carlo approximation of log p(R|u, z) are then regarded as ground truth. We compared z , σ 2 (z) and log p(R|u, z ) with the ground truth to show the approximation quality. Since there was no approximation for the posterior of z n , 1 ≤ n ≤ n 0 , we only needed to compare the approximation for z n , n 0 + 1 ≤ n ≤ N. In our comparison, we used the HMC method to sample from the posterior p(z|y1, y2 , R) ∝ p(z) p(y2 |z) p(y1|z) p(R|u, z).

2 The matrix A1 contains the eigenvectors of (Id − B−1 1 ) −1 −1 21 (Id − B2 )(Id − B1 ) and A2 contains the eigenvectors of 1 −1 −1 12 2 (Id −B−1 2 ) (Id −B1 )(Id −B2 ) . The correlation coefficient can be calculated as follows: 1

(8)

IV. E XPERIMENTS AND R ESULTS To verify the new algorithm, we first investigated the accuracy of the approximation used in the variational E-step. Second, we assessed the performance of the proposed algorithm experimentally by estimating correlation coefficients of synthetic datasets, and compared the proposed algorithm with a maximum a posteriori (MAP) solution and a hybrid Monte Carlo (HMC) solution. A. Synthetic Data To test the performance of the new algorithm, we carried out controlled experiments on synthetic datasets. The dataset is generated following the model shown in Fig. 2. First, we randomly generated a set of z n from normal distribution N (0, 1), and randomly generated W ∈ R6 and μ. A set of data points {x 1n } and {x2n } ∈ R5 are created following p(x 1 |z) and p(x2 |z), respectively. These data points are then contaminated by randomly generated measurement errors Sn and Tn in

(9)

To use HMC, we needed to provide the gradient of nega- tive log-likelihood Lhmc = log p(z) p(y2 |z) p(y1|z) p(R|u, z) with respect to z, which is as follows: ∂(−Lhmc ) = z − W [ + M]−1 (y − Wz) ∂z w1 N (u n |w1 z + μ1 , σ12 ) . + p(R| , z)

T Bi = Wˆ i i−1 Wˆ i + Id ; ˆ 1T + 1 ; 22 = W ˆ 2T + 2 ; ˆ 1W ˆ 2W =W ˆ 2T ; 12 = W ˆ 1T . ˆ 1W ˆ 2W

12 = W

T U U1d 12 2d . ρ=  T T U U1d 11 U1d U2d 22 2d

[0, 1]. We then selected some data points to be censored in a uniformly random manner. Random positive values are added to these selected values, and the resultant values are output as upper limits. The true coefficient can be evaluated by (8). B. Approximation Accuracy

n=1



5

(10)

In our experiments, we sampled 5000 data points by using HMC. The first 3000 samples are discarded as burn-in. Fig. 3 shows the comparison results. The number of data points in the experimental dataset is 100. The number of censored data in the upper figures is 10, and 90 in the lower figures. In the figures, the first column is the comparison of the posterior mean z n obtained by using the approximation and HMC. The second column is the comparison of the posterior variance. The third column is the comparison of the expectation of log p(Rn |u n , z n ). From the figure, we can see that the compared quantities, except the posterior variances, are close to the ground truth. This indicates that the approximation employed in (4) is reasonable. C. Comparison With MAP and HMC In addition to the MLE of the proposed model, we can also estimate the model parameters by using MAP and MarkovChain EM. This section compares the proposed algorithm with MAP and MCMC in terms of estimated correlation coefficients. 1) MAP: To proceed with MAP, we need to maximize the log-likelihood of MAP, F M A P , with respect to the model parameters. For the first n 0 data points, the log-likelihood is FM A P =

n0  n=1

  log p(z n ) p(y2n |z n ) p(y1n |z n ) .

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

2

0.4

1.5

0.38

1

0.36

0

1.5

−1

0.35

−0.2

0

0.3

−0.3

−0.5

0.25

−0.4

0.2

−0.5

0.15

−0.6

0.5

0 0.3

−1.5

−0.5 0.28 −2

0.26

−1.5 approx. HMC

−2.5 0

0.22

50 100 posterior mean

−1 −1.5

0.24

−2

−0.1

1

0.32

−1

approx. HMC

0.4

−0.5

0.34

0.5

0

0.45

2

−2.5 approx. HMC

approx. HMC

0.2 0 50 100 posterior variance

−3 0

5 10 E(log p(R|z))

−2 −2.5 0

approx. HMC 50 100 posterior mean

0.1 −0.7 0 50 100 0 posterior variance

(a)

approx. HMC

50 100 E(log p(R|z))

(b)

Fig. 3. Comparison of the posterior mean, variance, and likelihood between the proposed algorithm and HMC on synthetic datasets with ten upper limits (left figures) and 90 upper limits (right figures).

For the last data points FM A P =

N 

log



p(z n ) p(y2n |z n ) p(Rn



= 0|u n , z n ) .

0.25

0.6

0.2

0.4

0.15

0.2

n=n 0 +1

0.99 0.8

0.45

0

For the first n 0 data points, solving the stationary equation of F M A P with respect to z n , we have zn =

0.18

0.9

WT [ + M]−1 (yn − μ)

HMC GEM MAP 0.6 0.4

.

1 + WT [ + M]−1 W For the last data points, we have the following gradient ∂F M A P = −z n + W2T [2 + Tn ]−1 (y2n − μ˜ n2 ) ∂z n w1 N (u n |w1 z n + μ1 , σ1 ) − . (11) p(Rn | n , z n ) The other parameters can be solved analytically or using gradient methods. We omit here because of the length limits. 2) HMC-EM: As the negative log-likelihood of the proposed model with respect to the latent variable z is differentiable, the HMC method shall be more efficient than other MCMC methods. The gradient is already derived in (10). The HMC-EM is iterated as follows: 1) Calculate the mean and variance of q(z n ), 1 ≤ n ≤ n 0 + 1 by (5) and (6). 2) Sample z n , n 0 + 1 ≤ n ≤ N by using HMC. 3) Estimate the model parameters. In our implementation, only one sample is drawn by using HMC. This method is known as stochastic EM [1]. In the following experiments, 2000 iterations are carried out to perform HMC-EM to make sure convergence. 3) Comparison Results: We compared the correlation coefficients estimated by the proposed algorithm, MAP and HMCEM on synthetic datasets with different number of detection limits and different correlation coefficient settings. The boxplot of the results in 20 runs is shown in Fig. 4. In each plot, the leftmost boxplot shows the results of HMC-EM (denoted by HMC), the central is the results of the proposed algorithm (denoted by GEM), and the rightmost is the results of MAP (denoted by MAP). To produce the plots, the sizes of the created censored datasets are all 100. The sizes of the upper limits in each row were 10, 50, and 90, respectively.

0.19

0.2 0 HMC GEM MAP

0.4

0.18

0.7

HMC GEM MAP 0.43 0.42 0.41 0.4 0.39

0.42

0.65

0.65

HMC GEM MAP

0.43

0.9

HMC GEM MAP 0.9 0.8

0.68

HMC GEM MAP

0.94 0.92 0.9

0.7

0.88

0.35 HMC GEM MAP

0.99

0.8

0.6

0.4 HMC GEM MAP

0.985

0.96

0.7

0.45

0

0.97

1 0.75

0.5 0.2

0.98

HMC GEM MAP

HMC GEM MAP 0.55

0.68

HMC GEM MAP

0.93 HMC GEM MAP

Fig. 4. Comparison results among the proposed algorithm GEM, MAP, and the HMC-EM. The y-axis of the plots is the correlation coefficients. In the x-axis, HMC, GEM and MAP show the results obtained by HMCEM, the proposed algorithm and MAP, respectively. The numbers of limits in the first, second, and third row are 10, 50, and 90, respectively. The numbers in the plots are the true correlation coefficients.

As expected, we see from the figure that HMC-EM performs the best, but its performance become gradually worse along with the increase of the number of upper limits. That is, along with the limits increase, the found coefficients have larger variances and differ much from the true coefficients. This is intuitive as the more the limits, the more difficult to retrieve the true correlation coefficients. Among the three algorithms, the results obtained by MAP have large variances and the means are far away from the true coefficients. Especially, for some cases, the coefficients found by MAP range from nearly (0,1), which indicate that MAP does not perform well. It can also be seen from the figure that the proposed algorithm is not quite as accurate as HMC-EM. However, the superiority of HMC-EM over the proposed algorithm is at the price of a heavy computational demand and difficulties in determining its convergence. The log-likelihoods against iteration in a run on the synthetic dataset by the proposed algorithm and HMCEM are shown in Fig. 5. We can see that the likelihood bound

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. SUN AND KEATES: CCA ON DATA WITH CENSORING AND ERROR INFORMATION

7

0.9

−400

0.8

−450

log−likelihood

0.7 −500

0.6

−550

0.5 0.4

−600

0.3 −650

0.2 0.1

−700 −750 0

500

1000 iterations

1500

0 0

2000

2

4

6

8

10

12

14

16

18

20

Fig. 6. First bar is the found correlation coefficient between the found variable sets and the X-ray, while the others are the coefficients between the individuals variables and the X-ray.

−508 −510

log−likelihood

−512 −514 −516 −518 −520 −522 −524 −526 0

20

40

60

80

100

iterations

Fig. 5. (Upper plot) Optimization process of HMC-EM and (lower plot) the proposed algorithm.

obtained by our algorithm is close to that of the HMC-EM, but our algorithm converges very fast (around 20–40 iterations) compared with around 1200 iterations for HMC-EM.

We applied the developed algorithm to estimate the correlation between various subsets of the 18 property variables and the X-ray observations. We found that the most correlated properties to the X-ray luminosity do include the dispersion in velocities of the galaxies in the group, and the spatial size of the group. This observation justifies our hypothesis. In addition, the magnitude of the brightest galaxy, the number of galaxies in the group, the average distance to the central galaxy, and the mean eta parameters (This gives an idea of the morphology of the central galaxy of the group, with lower values indicating that the galaxy is more likely to be earlytype rather than late-type) for all the galaxies in the group, are also correlated to the X-ray. We found that on average (over ten runs), the coefficient between the most correlated properties and the X-ray luminosity is 0.8315, which is much greater than 0.1551, the average of the coefficients obtained for the individual variables. Fig. 6 shows those coefficients. It can be seen from the figure that some of the variables have relatively large coefficients, but it is the combination of these found properties that achieves the greatest coefficients.

V. A PPLICATION TO M EASURING C ANONICAL C ORRELATION IN G ALAXY G ROUPS In our application, the dataset we analyzed here is new in astronomy community. Each datum in the dataset contains the information of a group of galaxies. The optical properties, dynamical properties and the X-ray luminosity are used to characterize each group of galaxy. The dataset has 69 galaxy groups, each group has 18 optical and dynamical properties and the X-ray luminosity. That is, the number of data points is N = 69, d1 = 1, and d2 = 18. In the dataset, 11 X-ray luminosities are left-censored, while many properties are with measurement errors. Our hypothesis is that the sample galaxy contains groups of different evolutionary states. We can characterize the state of collapse of a particular group, by looking at the dispersion in velocities of the galaxies in that group and the spatial size of the group. It is expected that there is hot, X-ray emitting gas in groups that have collapsed, but we wanted to test how other properties of the group, and indeed of the central galaxy in that group, are related to the evolutionary state.

VI. M ODEL E XTENSION TO M ULTIVARIATE DATA In previous section, we provided a CCA model and an algorithm for dealing datasets with measurement errors in both views, and censoring in the second view which is 1-D. The developed algorithm is applied particularly for the astronomical dataset. Here, we extend the model to multivariate case. That is, we consider x1 ∈ Rd1 , x2 ∈ Rd2 , the missing and censoring exist in both views. Fig. 7 shows the proposed model. Here, we assumed the covariance matrices 1 and 2 are diagonal and D = d1 + d2 . In addition, in the rest of this paper, we denote     y1 μ1 y= μ= (12) y2 μ2 and



   W1 R1 W= R= W2 R2     S 0 1 0 M= = 0 2 0T

(13)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

z

x2

u2 T

x1

u1

R2

R1

y2

y1

S

Fig. 7. Directed acyclic graph representations of the proposed model for multivariate data.

where R1 = [r11 , . . . , r1d1 ]T and R2 = [r21 , . . . , r2d2 ]T are the response indicators, x1 and x2 which are latent variables. Since both 1 and 2 are diagonal,  can be written as follows: = diag[σ1 , . . . , σ D ]. In the following, we denote x = [x1 x2 ]T . The probability distributions appeared in the multivariate model are the same as previous model except that they are multivariate. Particularly the joint probability of the indicator vector R and x given the upper limits u = [u 11 , . . . , u 1d1 , u 21 , . . . , u 2d2 ]T and z can be written as in (14). The full factorization in the equation is because of the assumption that the censoring is independent of data component p(x, R|u, z) = =

D  i=1 D 

p(x i , Ri |u i , z) p(Ri |u i , z)1−Ri p(x i |z) Ri .

(14)

i=1

 p(Ri = 0|u i , z) = p(x i < u i |z) =

ui −∞

p(x i |z)d x i

where p(x i |zn ) = N (x i |wiT zn , σi ), where wi is the transpose of the i -th row of W. To compute the full data log-likelihood, for each n, 1 ≤ n ≤ N, let In = {m|Rnm = 0} and Jn = {m|Rnm = 1}. The full data log-likelihood can be written as follows: L=

 p(xn , Rn |un , zn ) p(yn |xn ) p(zn )dxn dzn

log

n=1

=

N 

 log

n=1

⎡ ⎣



p(Rin |u ni , zn )

j



⎤ j p(yn |zn )⎦

p(zn )dzn

j ∈Jn

i∈In j

bound can be written as follows: ⎡  $  # ⎣ log p Rni |u in , zn F = n

+

i∈In

#

$



j log p(yn |zn ) ⎦ +

#

j ∈ Jn

j

where p(yn |zn ) = N (yn |w Tj zn +μ j , σ j +Mn ) and p(yn |zn ) = N (yn |Wzn + μ,  + Mn ). Correspondingly, the log-likelihood

$ log p(zn ) − log q(zn ) .

n

The same as in Section III-B, we used the likelihood bound to develop a GEM algorithm with approximate E-step. In the E-step, we assumed that q(zn ) is a normal distribution with mean zn and a diagonal covariance matrix zn = diag(σˆ n1 , . . . , σˆ nd ), 1 ≤ n ≤ N. Through some mathematical computation, it can be seen that only zn be obtained analytically can as follows: ⎧ 1 Rni = 0 ⎨  −1 σˆ ni = Wii2 ⎩ 1+ Rni = 1.  +Mi i

Therefore, if Ri = 0, we have

N 

Fig. 8. Comparison between the true coefficients and the obtained coefficients with different percentages of upper limits.

n

The gradient of the log-likelihood bound F with respect to zn can be computed as follows: ⎧ W (y i −μ˜ i )2 ⎨ Rni = 1 −zn i + ii σ n+Mi n ∂F i n = i T ⎩ −zn i − Wii N (u n |wi zn +μi ,σi ) R i = 0. ∂zn i p(Rni =0|u in ,zn )

n

In the M-step, first, the gradient of F with respect to μ can be obtained as follows:  y i − μ˜ n i ∂F n = + [A1 , . . . , A D ]T i ∂μi σ + M i n n:Rni =1    zn wiT ∂F i i = (yn − μ˜ n )zn − ∂wi σi + Mni i n:Rn =1

+[B1 , . . . , B D ]T .

(15)

In (15), Ai represents the derivative of μi . It can be written as follows:  N (u in |w T zn + μi , σi ) i (16) Ai = i = 0|u i , z ) p(R n n n i n:Rn =0

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. SUN AND KEATES: CCA ON DATA WITH CENSORING AND ERROR INFORMATION

9

−460

0.63 0.6 −480 0.5

log−likelihood

−500 0.4

−520

0.3 0.2

−540

0.1 −560 0 0.01 0.05 0.1 0.2 0.5

1

2

3

4

5

10

15

20

25

50

−580 0.01 0.05 0.1 0.2 0.5

gamma prior: G(α,0.001)

(a)

1 2 3 4 5 gamma prior (a, 0.001)

10

15

20

25

50

(b) 0

0.6

−50

0.52

−100

test log−likelihood

0.5

0.4

0.3

0.2

0.1

−150 −200 −250 −300 −350 −400 −450

0 0.01 0.05 0.1 0.2 0.5

1

2

3

4

5

10

15

20

25

50

gamma prior: G(α,0.001)

−500 0 0.01 0.05 0.1 0.2 0.5 1 2 3 4 5 gamma prior (α, 0.001)

(c)

10 15

20 25

50

(d)

Fig. 9. Holdout-validation for selecting gamma prior a. Plots (a) and (c): Boxplots of correlation coefficients obtained against different a with 10 and 90 limits, respectively, while (b) and (d): Respective out-of-sample log-likelihood against different a.

and Bi in (15) is the derivative of wi . It can be obtained as follows: Bi =

 zn N (u in |w T zn + μi , i ) i . i = 0|u i , z ) p(R n n n i

(17)

n:Rn =0

The gradient of F with respect to the covariance matrix  is not available. The same as in the previous algorithm, the downhill simplex method can be adopted to estimate these parameters. We studied the performance of the developed algorithm using a dataset created following the same method described in Section IV-A. The size of the dataset is 500, and the two views {x 1n ∈ R5 } and {x2n } ∈ R5 . Again, the measurement errors are assumed to be sampled from N (0, 1). The developed algorithm is carried out ten times on the dataset. Fig. 8 shows the error bar of the obtained largest coefficients for the dataset with different percentages of upper limits (namely, 10%, 50%, and 90%). These coefficients are compared with the true coefficient (which is 0.75). From the figure, it can be seen that the obtained coefficients are close to the true coefficient, especially when only 10% of the data are upper limits. The variances of the results become larger along with the increasing of the number of upper limits. It is intuitive

because the more the censored values, the less the useful information could be extracted. VII. D ISCUSSION AND F UTURE W ORK A. Model Selection It can be seen from the log-likelihood formulation that if the number of censored data is large, the expectation of log p(Rn |u n , z n ) will dominate the log-likelihood. In maximizing the likelihood, the optimization process will lead the log-likelihood function to the local optimum at σ1 = 0. To avoid this problem, we proposed putting a gamma prior G(a, 0.001) on σ1 . We can determine a by using the hold-out validation method. To carry out the validation, we created a 150 data point dataset {y1n , y2n , Rn }. 100 randomly selected data points are used for training, and the rest for testing. After training with different settings for a, the log-likelihood of the test dataset (out-of-sample log-likelihood, which can be calculated by performing the approximated E-step with the model parameters fixed) is used to select the optimal a value. Fig. 9 shows the boxplot of the correlation coefficients obtained by using different a values in the left column, and the average out-of-sample log-likelihood in the right column. The sizes of upper limits in each row of Fig. 9 are 10 and 90,

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

respectively. From the figures, we see that the optimal setting for a is 1.0 in both cases. The optimal a setting corresponds to an optimal recovered correlation coefficients. B. Measurement Error Model In this paper, measurement error is modeled as a normal distribution. It is noticed recently that the normality assumption may not always leads to algorithms with robust and reliable performance, particularly in the case that the data exhibits skewness. To address this problem, skewnormal distributions [4], [39] and other distributions, such as t-distribution [11] and Weibull distribution [47], can be applied for modeling the measurement error. Recent research on nonlinear mixed-effects models is carried out with skew-normal distribution [26], [44] and scale mixtures of skew normal distributions [7]. These studies have shown promising results. In the future, we will adopt the skew-normal prior for the measurement error. VIII. C ONCLUSION In this paper, we proposed a probabilistic model for canonical coefficient analysis on data with censoring and measurement error. A GEM algorithm was developed with an approximated variational E-step. Controlled experiments showed that the proposed algorithm can recover the true correlation coefficients effectively. It was also shown that the developed algorithm performed better than a MAP solution, and was comparable with a HMC EM solution while maintaining computational simplicity. We then applied the proposed algorithm to censored galaxy group data on finding which properties of galaxy group were mostly correlated with the X-ray luminosity. Finally, we extended the proposed model to multivariate case, studied the model selection issue, and discussed future work on adopting various distribution for modeling measurement error. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their helpful and constructive suggestions and comments. JS was supported by the National Natural Science Foundation of China (No.61273313). R EFERENCES [1] C. Andrieu, N. de Freitas, A. Doucet, and M. Jordan, “An introduction to MCMC for machine learning,” Mach. Learn., vol. 50, nos. 1–2, pp. 5–43, 2003. [2] C. Archambeau, N. Delannay, and M. Verleysen, “Robust probabilistic projections,” in Proc. 23rd Int. Conf. Mach. Lern., 2006, pp. 33–40. [3] H. Avron, C. Boutsidis, S. Toledo, and A. Zouzias, “Efficient dimensionality reduction for canonical correlation analysis,” in Proc. 30th Int. Conf. Mach. Learn., 2013, pp. 1–22. [4] A. Azzalini, “A class of distributions which includes the normal ones,” Scandin J. Stat., vol. 12, no. 2, pp. 171–178, 1985. [5] F. Bach and M. Jordan, “A probabilistic interpretation of canonical correlation analysis,” Dept. Stat., Univ. California, Berkeley, CA, USA, Tech. Rep. 688, 2005. [6] A. Bruguier, K. Preuschoff, S. Quartz, and P. Bossaerts, “Investigating signal integration with canonical correlation analysis of fMRI brain activation data,” NeuroImage, vol. 41, no. 1, pp. 35–44, 2008.

[7] V. G. Cancho, D. K. Dey, V. H. Lachos, and M. G. Andrade, “Bayesian nonlinear regression models with scale mixtures of skew-normal distributions: Estimation and case influence diagnostics,” Comput. Stat. Data Anal., vol. 55, no. 1, pp. 588–602, 2011. [8] S. Chatzis and Y. Demiris, “Nonparametric mixtures of Gaussian processes with power-law behavior,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 12, pp. 1862–1871, Dec. 2012. [9] N. M. Correa, T. Eichele, T. Adali, Y.-O. Li, and V. D. Calhoun, “Multiset canonical correlation analysis for the fusion of concurrent single trial ERP and functional MRI,” NeuroImage, vol. 50, no. 4, pp. 1438–1445, 2010. [10] M. Davidian and D. M. Giltinan, “Nonlinear models for repeated measurement data: An overview and update,” J. Agricult. Biol. Environ. Stat., vol. 8, no. 4, pp. 387–419, 2003. [11] M. de Castro and M. Galea, “Robust inference in an heteroscedastic measurement error model,” J. Korean Stat. Soc., vol. 39, no. 4, pp. 439–447, 2010. [12] R. C. Elandt-Johnson and N. Johnson, Survival Models and Data Analysis. New York, NY, USA: Wiley, 1999. [13] A. Eleuteri, R. Tagliaferri, L. Milano, S. D. Placido, and M. D. Laurentiis, “A novel neural network-based survival analysis model,” Neural Netw., vol. 16, no. 5, pp. 855–864, 2003. [14] A. Faisal, R. Louhimo, L. Lahti, S. Kaski, and S. Hautaniemi, “Biomarker discovery via dependency analysis of multi-view funcational genomics data,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, Dec. 2011. [15] B. Fischer, V. Roth, and J. M. Buhmann, “Time-series alignment by non-negative multiple generalized canonical correlation analysis,” BMC Bioinf., vol. 8, pp. S4–S4, Dec. 2007. [16] R. Fletcher, Practical Methods of Optimization, vol. 1. New York, NY, USA: Wiley, 1980. [17] K. Fukumizu, F. R. Bach, and A. Gretton, “Consistency of kernel canonical correlation analysis,” J. Mach. Learn. Res., vol. 8, pp. 361–383, Nov. 2006. [18] A. Girard, C. Rasmussen, J. Q. Candela, and R. Murray-Smith, “Multiple-step ahead prediction for non linear dynamic systems— A Gaussian process treatment with propagation of the uncertainty,” in Advances in Neural Information Processing Systems, vol. 15, S. Becker, S. Thrun, and K. Obermayer, Eds. Cambridge, MA, USA: MIT Press, 2003, pp. 545–552. [19] D. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural Comput., vol. 16, no. 12, pp. 2639–2664, 2004. [20] D. Hardoon and J. Shawe-Taylor, “KCCA for different level precision in content-based image retrieval,” in Proc. 3rd Int. Workshop ContentBased Multimedia Indexing, Sep. 2004. [21] D. Hardoon and J. Shawe-Taylor, “Sparse canonical correlation analysis,” Mach. Learn., vol. 83, no. 3, pp. 331–353, 2011. [22] M. Hassan, S. Boudaoud, J. Terrien, B. Karlsson, and C. Marque, “Combination of canonical correlation analysis and empirical mode decomposition applied to denoising the labor electrohysterogram,” IEEE Trans. Biomed. Eng., vol. 58, no. 9, pp. 2441–2447, Sep. 2011. [23] T. Hastie, A. Buja, and R. Tibshirani, “Penalized discriminant analysis,” Ann. Stat., vol. 23, no. 1, pp. 73–102, 1995. [24] C. Heuchenne and I. V. Keilegom, “Nonlinear regression with censored data,” Technometrics, vol. 49, no. 1, pp. 34–44, 2007. [25] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, pp. 321–327, Dec. 1936. [26] Y. Huang and G. Dagne, “Simultaneous Bayesian inference for skewnormal semiparametric nonlinear mixed-effects models with covariate measurement error,” Bayesian Anal., vol. 7, no. 1, pp. 189–210, 2012. [27] S. van Huffel and P. Lemmerling, Total Least Squares and Errors-inVariables Modeling. Dordrecht, The Netherlands: Kluwer, 2002. [28] J. Ibrahim, M.-H. Chen, and D. Sinha, Bayesian Survival Analysis, 1st ed. New York, NY, USA: Springer-Verlag, 2001. [29] M. Jin, R. Nandy, and D. Cordes, “A novel test statistic for local canonical correlation analysis of fMRI data,” NeuroImage, vol. 47, pp. S123–S123, May 2009. [30] B. Kelly, “Measurement error models in astronomy,” in Statistical Challenges in Modern Astronomy. New York, NY, USA: SpringerVerlag, Dec. 2011. [31] T.-K. Kim and R. Cipolla, “Canonical correlation analysis of video volume tensors for action categorization and detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 8, pp. 1415–1428, Aug. 2009.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. SUN AND KEATES: CCA ON DATA WITH CENSORING AND ERROR INFORMATION

[32] A. Klami and S. Kaski, “Probabilistic approach to detecting dependencies between data sets,” Neurocomputing, vol. 72, nos. 1–3, pp. 39–46, 2008. [33] M. Kuss and T. Graepel, “The geometry of kernel canonical correlation analysis,” Max Planck Institute for Biological Cybernetics, Tübingen, Germany, Tech. Rep. TR-108, 2003. [34] P. Lai and C. Fyfe, “Kernel and nonlinear canonical correlation analysis,” Int. J. Neural Syst., vol. 10, no. 5, pp. 365–377, Oct. 2000. [35] E. Lee and J. Wang, Statistical Methods for Survival Data Analysis. New York, NY, USA: Wiley, 2005. [36] Y. O. Li, T. Adali, and V. D. Calhoun, “A group study of simulated driving fMRI data by multi-set canonical correlation analysis,” NeuroImage, vol. 47, pp. S124–S124, Feb. 2009. [37] X. Liu, M. Milo, N. D. Lawrence, and M. Rattray, “Probe-level measurement error improves accuracy in detecting differential gene expression,” Bioinformatics, vol. 22, no. 17, pp. 2107–2113, 2006. [38] Y. Liu, X. Liu, and Z. Su, “A new fuzzy approach for handling class labels in canonical correlation analysis,” Neurocomputing, vol. 71, no. 7, pp. 1735–1740, 2008. [39] A. O’Hagan and T. Leonhard, “Bayes estimation subject to uncertainty about parameter constraints,” Biometrika, vol. 63, no. 1, pp. 201–202, 1976. [40] M. Osman and S. Ghosh, “Nonparametric regression models for rightcensored data using Bernstein polynomials,” Comput. Stat. Data Anal., vol. 59, no. 3, pp. 559–573, Mar. 2012. [41] K. S. Park, M. S. Park, K. W. Lee, and D. Kim, “Joint use of DEA and constrained canonical correlation analysis for efficiency valuations involving categorical variables,” J. Oper. Res. Soc., vol. 60, no. 12, pp. 1775–1785, 2009. [42] A. Penalver and F. Escolano, “Entropy-based incremental variational bayes learning of Gaussian mixtures,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 3, pp. 534–540, Mar. 2012. [43] J.-K. Rhee, J.-G. Joung, J.-H. Chang, Z. Fei, and B.-T. Zhang, “Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis,” BMC Genom., vol. 10, pp. S29–S29, Dec. 2009. [44] J. Rodrigues and H. Bolfarine, “Bayesian inference for an extended simple regression measurement error model using skewed priors,” Bayesian Anal., vol. 2, no. 2, pp. 349–364, 2007. [45] C. Sakar and O. Kursun, “A method for combining mutual information and canonical correlation analysis: Predictive mutual information and its use in feature selection,” Expert Syst. Appl., vol. 39, no. 3, pp. 3333–3344, 2010. [46] D. Samarov, J. Marron, Y. Liu, C. Grulke, and A. Tropsha, “Local kernel canonical correlation analysis with application to virtual drug screening,” Ann. Appl. Stat., vol. 5, no. 3, pp. 2169–2196, 2011. [47] H. Schneeweiss and H. Shalabh, “On the estimation of the linear relation when the error variances are known,” Comput. Stat. Data Anal., vol. 52, no. 2, pp. 1143–1148, 2007. [48] P. Shivaswamy, “A support vector approach to censored targets,” in Proc. IEEE Int. Conf. Data Mining, Oct. 2007, pp. 655–660. [49] J. Sun and A. Kaban, “A fast algorithm for robust mixtures in the presence of measurement errors,” IEEE Trans. Neural Netw., vol. 21, no. 8, pp. 1206–1220, Aug. 2010. [50] J. Sun, A. Kabán, and S. Raychaudhury, “Robust mixtures in the presence of measurement errors,” in Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 847–854. [51] L. Sun, S. Ji, and J. Ye, “Canonical correlation analysis for multilabel classification: A least-squares formulation, extensions, and analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 194–200, Jan. 2011. [52] J. Taylor, An Introduction to Error Analysis. Mill Valley, CA, USA: University Science, 1996. [53] M. van de Velden and Y. Takane, “Generalized canonical correlation analysis with missing values,” Comput. Stat. Data Anal., vol. 27, no. 3, pp. 551–571, 2012. [54] J. Vía, I. Santamaría, and J. Pérez, “A learning algorithm for adaptive canonical correlation analysis of several data sets,” Neural Netw., vol. 20, no. 1, pp. 139–152, 2007.

11

[55] S. Waaijenborg and A. Zwinderman, “Penalized canonical correlation analysis to quantify the association between gene expression and DNA markers,” BMC Proc., vol. 1, pp. S122–S122, Dec. 2007. [56] S. Waaijenborg and A. H. Zwinderman, “Correlating multiple SNPs and multiple disease phenotypes: Penalized non-linear canonical correlation analysis,” Bioinformatics, vol. 25, no. 21, pp. 2764–2771, 2009. [57] S. Waaijenborg and A. H. Zwinderman, “Sparse canonical correlation analysis for identifying, connecting and completing gene-expression networks,” BMC Bioinformat., vol. 10, no. 1, p. 315, 2009. [58] D. M. Witten, R. Tibshirani, and T. Hastie, “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis,” Biostatistics, vol. 10, no. 3, pp. 515–534, 2009. [59] D. M. Witten and R. J. Tibshirani, “Extensions of sparse canonical correlation analysis with applications to genomic data,” Stat. Appl. Genet. Molecular Biol., vol. 8, no. 1, pp. 1–27, 2009. [60] L. Wu, X. J. Hu, and H. Wu, “Joint inference for nonlinear mixed-effects models and time to event at the presence of missing data,” Biostatistics, vol. 9, no. 2, pp. 308–320, 2008. [61] F. Yger, M. Berar, G. Gasso, and A. Rakotomamonjy, “Adaptive canonical correlation analysis based on matrix manifold,” in Proc. 29th Int. Conf. Mach. Learn., 2012, pp. 1–8.

Jianyong Sun received the B.Sc. degree in computational mathematics from Xi’an Jiaotong University, Xi’an, China, and the M.Sc. degree in applied mathematics and the Ph.D. degree in computer science from the University of Essex, Colchester, U.K., in 1997, 1999, and 2005, respectively. He is currently a Lecturer with the School of Computing, Engineering and Applied Mathematics, University of Abertay Dundee, Dundee, U.K. He has published over 30 peer-reviewed papers on evolutionary algorithms and statistical machine learning in prestigious journals, such as the IEEE T RANSACTIONS ON E VOLUTION ARY C OMPUTATION , the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS , the IEEE T RANSACTIONS ON C OMPUTATIONAL B IOLOGY AND B IOINFOMATICS, and PNAS. His current research interests include evolutionary computation, statistical machine learning, and their applications in computational biology, bioinformatics, and image processing.

Simeon Keates received the B.A., M.A., and Ph.D. degrees in engineering from the University of Cambridge, Cambridge, U.K. He is a Chair of human-computer interaction and the Head of the School of Engineering, Computing and Applied Mathematics, University of Abertay Dundee, Dundee, U.K. He was a Royal Mail Research Fellow with the Engineering Design Centre, University of Cambridge, where he worked for six years before becoming a Research Staff Member with the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA, specializing in human-computer interaction and accessibility. He was with ITA Software, Cambridge, MA, USA, as a Usability Lead, before an Associate Professor with the IT University of Copenhagen, Copenhagen, Denmark. He is the author or co-author of two books and more than 150 technical papers. Prof. Keates is a member of the Institution of Engineering and Technology.

Canonical correlation analysis on data with censoring and error information.

We developed a probabilistic model for canonical correlation analysis in the case when the associated datasets are incomplete. This case can arise whe...
761KB Sizes 0 Downloads 3 Views