This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

1

Adaptive Quasi-Newton Algorithm for Source Extraction via CCA Approach Wei-Tao Zhang, Shun-Tian Lou, and Da-Zheng Feng, Member, IEEE

Abstract— This paper addresses the problem of adaptive source extraction via the canonical correlation analysis (CCA) approach. Based on Liu’s analysis of CCA approach, we propose a new criterion for source extraction, which is proved to be equivalent to the CCA criterion. Then, a fast and efficient online algorithm using quasi-Newton iteration is developed. The stability of the algorithm is also analyzed using Lyapunov’s method, which shows that the proposed algorithm asymptotically converges to the global minimum of the criterion. Simulation results are presented to prove our theoretical analysis and demonstrate the merits of the proposed algorithm in terms of convergence speed and successful rate for source extraction. Index Terms— Blind source extraction (BSE), canonical correlation analysis (CCA), Lyapunov method, Newton iteration, stability.

I. I NTRODUCTION

I

N THE past two decades, the blind separation of multiple sources from their linear mixtures has received considerable research interests. The instantaneous mixing model commonly reads x(t) = As(t), t = 1, 2, . . . , T

(1)

where s(t) = [s1 (t), . . . , s N (t)]T is the vector of unknown sources with zero mean, x(t) = [x 1 (t), . . . , x M (t)]T is the vector of mixed signals, A is a M × N (M ≥ N) unknown mixing matrix with full column rank, and t is the time index. The task now is to recover the original sources from the mixed signals without the prior knowledge of the sources and mixing matrix, hence the term blind. Generally, there are two classes of solutions to this problem: 1) the simultaneous separation approach [1]– [8] and 2) the sequential extraction approach [9]–[21]. In the first approach, all sources are separated simultaneously up to scale and permutation ambiguities [1], whereas in the second approach, the sources are extracted oneby-one by removing the already extracted sources from their mixtures with deflation techniques. It should be emphasized Manuscript received June 25, 2012; revised December 24, 2012; accepted February 20, 2013. This work was supported in part by the National Natural Science Foundation of China under Grant 61201299 and Grant 61074102, in part by the Fundamental Research Funds for the Central Universities under Grant K5051202044, and in part by the Research Fund for the Doctoral Program of Higher Education of China under Grant 20120203120007. W.-T. Zhang and S.-T. Lou are with the School of Electronic Engineering, Xidian University, Xi’an 710071, China (e-mail: [email protected]; [email protected]). D.-Z. Feng is with the National Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2280285

that the extraction approach has some advantages over the simultaneous separation approach. First, it is computationally cheaper when we are not interested in all of the sources [9]. Second, the simultaneous separation approach cannot work in some ill-conditioned cases, but the extraction approach does because it requires weaker solvability conditions than the simultaneous blind separation [10]. Last, the extraction approach provides more flexibility [11], it enables to extract the sources in a certain or prescribed order according to their statistical properties. In addition, we can extract only the source of interest if some a priori knowledge of the source of interest can be used through adding the associated regularization terms to the extraction criterion [12]. In this paper, we focus our attention on the extraction scheme. A number of blind extraction algorithms have been developed in the literature. To name a few, the FastICA with deflationary orthogonalization [5] is a typical sequential extraction algorithm for source separation. Similarly, the extraction algorithm in [10] and [15] use the contrast based on higher order statistics as the driven criteria as well. The higher order statistics are, however, usually sensitive to outliers. An alternative is to exploit the second-order statistics of sources with temporal structures. The extraction algorithm in [9] assumes that the source can be modeled by a stable autoregressive (AR) model. The sequential diagonalization algorithm (SDA) [11] exploits the joint diagonalization structure of the source correlation matrices. Algorithms in [13] and [14] assume that the sources have very special temporal structure, and they also implicitly use the periodicity of the sources. In addition, the canonical correlation analysis (CCA) approach is also proved to be a powerful tool for source extraction [16]–[19]. Most of these algorithms, however, work in batch mode, and hence cannot be used in real time applications. Through minimizing the normalized mean square prediction error (MSPE) of the output, several online extraction algorithms have been developed with fixed [20], [21] or optimal learning prediction coefficients [22]. These online algorithms, however, use a gradient descent technique and hence suffer from slow convergence rate. In addition, the convergence of the gradient learning algorithms is very sensitive to the mixing process, which leads to a lower successful rate in real applications. This may results from the lack of rigorous convergence analysis. The objective of this paper is to present a fast and stable quasi-Newton algorithm for adaptive source extraction via CCA approach. To develop fast Newton algorithm, we modify the CCA criteria to obtain a new extraction criteria with the

2162-237X © 2013 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

merit of only global minimum. The standard Newton direction is properly modified, yielding a more robust quasi-Newton algorithm. The convergence analysis shows that the proposed algorithm asymptotically converges to the global minimum of the criteria. Compared with the gradient learning algorithms, the proposed algorithm provides faster convergence speed and a higher successful rate. The remainder of this paper is organized as follows. In Section II, we present a simple analysis of CCA approach for source extraction and further make an additional assumption for future use. In Section III, the quasi-Newton algorithm is derived. Section IV is devoted to the convergence analysis of the proposed algorithm. Simulation results are presented in Section V. Finally, conclusions are drawn in Section VI. II. CCA A PPROACH FOR S OURCE E XTRACTION Let w be a M-dimensional extraction vector, then the extracted signal can be written as y(t) = wT x(t).

(2)

In the blind source extraction (BSE) context, we usually make the following assumption [19]–[23]. Assumption 1: The source signals are spatially uncorrelated but temporally colored, i.e., the source correlation matrix is diagonal Rsτ = diag[r1 (τ ), . . . , r N (τ )] with ri (τ ) = E[si (t)si (t − τ )] = 0, i ∈ I N for some nonzero time delay τ , where I N denotes the integer set {1, . . . , N}. The basic idea behind the CCA approach is that the sum of any uncorrelated signals has an autocorrelation whose value is less than or equal to the maximum value of individual signals [16]. Thus, the CCA approach maximize the normalized time delayed autocorrelation of the extracted signal y(t) with respect to the extraction vector w [19] wT Rxτ w E[y(t)y(t − τ )] = max J˜(w) = w E[y 2 (t)] wT Rx0 w

(3)

where Rxτ = E[x(t)xT (t − τ )] is the time delayed correlation matrix of the mixed signals and Rx0 = E[x(t)xT (t)] is the covariance matrix of the mixed signals. It can be easily seen that the CCA approach represents the same class of matrix pencil method [3], and the optimal solution is typically obtained by taking w to be the generalized eigenvector of matrix pencil (Rxτ , Rx0 ) associated with the maximum generalized eigenvalue. Let (λi , vi ), i ∈ I M denotes the generalized eigenpair of matrix pencil (Rxτ , Rx0 ), then Rxτ vi = λi Rx0 vi , i ∈ I M

viT Rxτ v j viT Rx0 v j

= λi δi j , i ∈ I M , = δi j , i ∈ I M ,

(4) j ∈ IM

j ∈ IM

(5) (6)

where δi j is the Kronecker delta. Clearly, the existence of the solution for sequentially extracting all the sources via CCA approach can be guaranteed if the generalized eigenvalues λi are different from each other, i.e., ∀i = j, λi = λ j . Without losing generality, we assume that the generalized eigenvalues are arranged in a descending order λ1 > λ2 > · · · > λ M . Using (1), we have Rxτ = ARτs AT and Rx0 = AR0s AT . Intriguingly, this indicates that (Rxτ , Rx0 ) and (Rsτ , Rs0 ) are

congruent matrix pencils if the mixing matrix A is assumed to be nonsingular [24] with M = N. Thus, (Rxτ , Rx0 ) and (Rsτ , Rs0 ) have the same generalized eigenvalues. According to Assumption 1, the source autocorrelation matrices are diagonal, and then the generalized eigenvalues of matrix pencil (Rsτ , Rs0 ) are normalized time delayed autocorrelation of source signals λi =

rρ(i) (τ ) ∀i ∈ I N rρ(i) (0)

(7)

where ρ is a permutation on the integer set I N . Equation (7) physically implies that the uniqueness condition of CCA approach for source extraction is that the source signals have different normalized spectra. This corresponds to the secondorder identifiability condition found in [1] and [2]. Based on this, it is also interesting to note that for a given time lag τ , the extraction order of the sources is independent of the mixing process and depends only on the sources themselves. That is to say, the sources are recovered in a certain order. This is a very useful property in the case where only the source signal with the maximum autocorrelation is of interest. Several online algorithms have been proposed based on the CCA criterion [19]–[23]. They use multiple time delays instead, and transform the CCA criterion into the minimization of normalized MSPE. However, they are all gradient-type algorithms and suffer from slow convergence rate because the source signals are all assumed to be temporally colored [25]. In the next section, we present a quasi-Newton algorithm to improve the convergence of the CCA-based online extraction algorithms. Before presenting our algorithm, we would like to introduce an additional assumption below. Assumption 2: Their exists a time delay τ such that the correlation matrix Rxτ is positive definite. The above assumption is of great significance because it makes the proposed criterion valid (8). Through observing the relation between Rxτ and Rsτ , Assumption 2 is actually the extension of Assumption 1, it further requires ri (τ ) > 0, i ∈ I N . In practice, Assumption 2 can be sufficiently satisfied when we choose the time delay τ close to zero lag [4]. III. FAST N EWTON A LGORITHM In this section, we firstly propose a new extraction criterion. Then, a quasi-Newton algorithm is derived to extract one single source through the minimization of this criterion. Finally, the deflation procedure can be adopted using the Householder transform if multiple sources are of interest. A. Criterion Based on Assumption 2, we propose to consider the following cost function:   (8) min J (w) = wT Rx0 w − log wT Rxτ w . w

It has been shown in Section II that the optimal solution to CCA approach is the generalized eigenvector of matrix pencil (Rxτ , Rx0 ) associated with the maximum generalized

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHANG et al.: ADAPTIVE QUASI-NEWTON ALGORITHM

3

eigenvalue. Now, we are interested in the minimum of (8), and we want to know the following. 1) Is there a global minimum of (8)? 2) What is the relation between this minimum and the CCA solution? 3) Are there any other local minimum of (8)? These questions are answered by the following results. Proposition 1: w is the stationary point of cost function (1) iif w = ± vi , i ∈ I M . At each stationary point, the cost function equals 1 − log λi , i ∈ I M . Proof: See Appendix A. Proposition 2: All stationary points of cost function (8) are saddle (unstable) points except w = ± v1 . At this point, the cost function attains the global minimum 1 − log λ1 . Proof: See Appendix B. These results show that the cost function J (w) has a unique global minimum, and there is no other local minimum. Thus, the global convergence can be guaranteed if one minimizes the cost function J (w) via an iterative algorithm. In addition, at the global minima, w will take the generalized eigenvector of matrix pencil (Rxτ , Rx0 ) associated with the maximum generalized eigenvalue λ1 . Therefore, the proposed criterion is equivalent to the standard CCA criterion for source extraction. It should be pointed out that the above equivalence only means that the proposed criterion has the same global optimum with the standard CCA criterion, it does not imply that the resulting algorithms have the same convergence behavior. In fact, the proposed criterion is distinct in landscape from the standard one, which determines different convergence speed of their corresponding gradient searching algorithms. Fig. 1 shows the surfaces of the two criteria for the special case when Rxτ = diag[4, 2] and Rx0 = diag[1, 2]. Note that the two criteria have the same global optimum located at the points w = [± 1, 0]T . We can see from Fig. 1 that the surface of the proposed criterion is steeper than that of standard CCA criterion around the global optima. Therefore, the gradient searching algorithm resulting from the proposed criterion is expected to converge faster than that from the standard CCA criterion. Besides, one sees that the proposed criterion is completely convex, whereas the standard CCA criterion is only quasiconvex, i.e., the proposed criterion is convex on both dimensions, but the standard CCA criterion is only convex on dimension w2 , whereas it is monotonically nondecreasing along the dimension w1 . The above distinction between the standard CCA criterion and proposed one causes that their corresponding learning algorithms have quite different convergence behaviors. In more detail, on one hand, if w1 is fixed, then the gradient learning of w2 based on both criterion result in similar convergence because both criterion are convex on the dimension w2 . On the other hand, if w2 is fixed, then the gradient learning of w1 result in completely different convergence behaviors. For the proposed criterion, the learning of w1 converges to the stable optimum of the criterion because the proposed criterion is convex on this dimension. This makes w close to global optimum point (± 1, 0). For the standard CCA criterion, the learning of w1 converges to

Fig. 1. Surfaces of the two criteria when Rxτ = diag[4, 2] and Rx0 = diag[1, 2]. (a) Standard CCA cost function. (b) Proposed cost function.

infinite because the standard CCA criterion is monotonically nondecreasing rather than convex. This makes w apart from global optimum point (± 1, 0). This drawback will severely affect the convergence speed of the resulting algorithm and causes the gradient algorithms sensitive to the initial value of extraction vector. Above all, the proposed criterion enables us to develop a fast and stable quasi-Newton learning algorithm for source extraction. B. Algorithm Derivation The Newton learning for minimizing the cost function (8) is given by H (w)w = −g(w) w(t) = w(t − 1) + w

(9) (10)

where g(w) and H (w) are gradient and Hessian matrix of J (w) with respect to w, which are evaluated by (31) and (37) in Appendixes A and B, respectively. w represents the standard Newton direction. Although we have shown in Appendix B that the Hessian matrix H (w) is positive definite at the global minima point, there is no guarantee that H (w) is positive at every iteration. In such a case, problem will arise because w in (9) is no longer guaranteed to be a descent direction. An alternative approach is to use the Levenberg–Marquardt’s modification [26] by

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Algorithm 1 The quasi-Newton algorithm

adding a multiple of identity matrix I to H (w), i.e., H (w) ← H (w) + εI, here ε is a damping factor. In addition, ε must be properly chosen such that H (w) + εI is positive definite. In addition, this modification increases the computational load because what we need is the inverse of Hessian matrix. We here propose to directly drop the last term in the right hand side of (37) to make the approximated Hessian matrix positive definite at every iteration step   −2  T  H (w) ≈ 2 Rx0 + 2 wT Rxτ w Rxτ wwT Rxτ . (11) Note that the computation of gradient and Hessian requires O(M 2 ) operations. However, solving (9) requires additional O(M 3 ) + O(M 2 ) operations, which is computationally demanding for online realization of the algorithm. A more efficient and numerically more robust way is to apply the matrix inversion lemma to compute the inverse of Hessian matrix (11). Let p = Rxτ w, we have   −1  −1  1  0 −1 2 Rx0 ppT Rx0 −1 H (w) = − (12) Rx  −1 2 λ˜ 2 + 2pT R0 p x

with the scalar λ˜ defined in (34). The substitution of (12) and (31) into (9) leads to the quasi-Newton direction  −1 3λ˜ Rx0 p −1 w = −H (w)g(w) = −w +  −1 . (13) λ˜ 2 + 2pT Rx0 p Then, the quasi-Newton learning algorithm is given by w(t) = w(t − 1) + w|w=w(t −1)  0 −1 ˜ 3λ(t) Rx p(t) = .  −1 λ˜ 2 (t) + 2pT (t) Rx0 p(t)

Addressing the issue of computational load, it is readily shown that the above algorithm requires 6M 2 + 7M multiplications per iteration. Although the gradient descent algorithms [19]–[23] usually require (P + 3)M + P multiplications (P is the length of the predictor), which is less than our algorithm. The higher complexity of the quasi-Newton algorithm is, however, compensated by its faster convergence speed. This has been verified in our simulations. Remark 1: Note that the initial value of Q and Rxτ should be symmetric and positive definite, whereas w(0) can be randomly chosen. The empirical choices of these initial values are w(0) = [1, . . . , 1]T , Q(0) = Rxτ (0) = κI, here κ is a small positive number. Remark 2: The approximation of Hessian matrix in (11) is of great importance because it results in two advantages: 1) this procedure guarantees the positive definiteness of the Hessian matrix at every iteration, which is necessary to stabilize the Newton algorithm, and 2) it also enables us to efficiently compute the inverse of the Hessian matrix using matrix inversion lemma, which leads to an efficient implementation of the Newton algorithm. In addition, the analysis in Section IV shows the asymptotic convergence of the proposed algorithm to the true solution. This implies that the approximation of Hessian matrix do not affect the convergence of the proposed algorithm. C. Deflation Procedure If all the sources have to be recovered, one can sequentially extract the remaining sources using the deflation technique. Assume that one source is successfully extracted by si (t) = wT x(t), i ∈ I N , then we can use this source to estimate its corresponding steering vector ai (the i th column of A). This can be done by minimizing min f (ai ) = E x(t) − ai si (t) 2 . ai

(15)

The solution is straightforward opt

ai

=

E[si (t)x(t)] . E[si2 (t)]

(16) opt

Using the finite samples, we can write the estimate of ai as  −1 aˆ i = siT si Xsi (17) with the data matrix X and data vector si defined by

(14)

To achieve an online realization of the algorithm, matrices Rxτ and Rx0 should be replaced by their sample estimate and recursively computed as Rxτ (t) = βRxτ (t − 1) + x(t)xT (t − τ ) Rx0 (t) = βRx0 (t − 1) + x(t)xT (t) where 0 < β ≤ 1 is the forgetting factor. Let Q = (Rx0 )−1 , Q can be also recursively computed using the matrix inversion lemma. Then, we summarize the quasi-Newton algorithm for extracting one single source as follows.

X = [x(1), . . . , x(T )], si = [si (1), . . . , si (T )]T .

(18)

Now, we can construct a M × (M − 1) matrix B = [b1 , . . . , b M−1 ] to remove the extracted source si from the mixtures x (t) = BT x(t) = BT As(t)

(19)

where the columns of B are chosen to be independent of each other and orthogonal to ai , i.e., b j ⊥ai , j ∈ I M−1 . Thus, the i th column of BT A is completely zero, and hence x (t) is a mixture of s(t) that does not contain the already extracted source si . Then, we can extract another source from x (t) by applying the proposed algorithm once more.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHANG et al.: ADAPTIVE QUASI-NEWTON ALGORITHM

5

Note that the choice of B is not unique, we propose to use the Householder transform [24] to construct matrix B. Let G = [(ai /||ai ||), B] be an orthonormal matrix with the first column (ai /||ai ||), then G can be constructed as g = ai + sgn(a1 ) ai e1  −1 G = I − 2 g T g ggT

(20) (21)

where sgn(a1 ) is the sign of the first element of ai and e1 that is the first column of I. Remark 3: The independence of the columns of B must be guaranteed because this ensures that the new mixing is still full column rank for remaining sources. It should be pointed out that the Householder reflector matrix G defined above guarantees that the columns of B are orthogonal with each other, which will further improve the conditioning of the original mixing. IV. A SYMPTOTIC C ONVERGENCE In this section, the convergence analysis of the proposed algorithm is presented. This can be done by analyzing the corresponding continuous-time ordinary differential equation (ODE) using the Lyapunov’s second stability theorem. For the proposed algorithm, the ODE is given by 

w ˙ =

dw(t) dt

= −w(t)+

0 −1 τ ˜ 3λ(t)(R x ) Rx w(t) .  −1 λ˜ 2 (t) + 2wT (t)Rxτ Rx0 Rxτ w(t)

M 

αi (t)vi .

(23)

i=1

It can be seen that vlT Rx0 w(t) =

M 

αi (t)vlT Rx0 vi = αl (t).

(24)

i=1

Then, we have α˙ l (t) = vlT Rx0 w(t) ˙ = −vlT Rx0 w(t) + ˜ with λ(t) =

M

3λ˜ (t)vlT Rxτ w(t)  −1 λ˜ 2 (t) + 2wT (t)Rxτ Rx0 Rxτ w(t)

λi αi2 (t). Using (39), one obtains  −1 Rxτ Rx0 Rxτ = V−T D2 V−1 .



θl (t) =

˜ 3λ(t)λ l − 1. M 2 2 λi αi (t) λ˜ 2 (t) + 2 i=1

(28)

It can be easily seen from (27) that αl = 0 is an equilibrium state for the dynamic system modeled by (27), that is to say αl (0) = 0 implies αl (t) = 0 for all t. Conversely, αl (t) = 0 iif αl (0) = 0. Thus, Lemma 1 is proved. Lemma 1 means that any trajectory w(t) starting inside one of the above sets remains there for all subsequent times. Based on this lemma, we present our main result for the convergence of the proposed algorithm as follows. Proposition 3: Given the positive definite matrix pencil (Rxτ , Rx0 ) with generalized eigenvalues λ1 > λ2 > · · · > λ M , w = ± v1 is an asymptotically stable equilibrium point for the ODE (22) with a domain of attraction = {w|v1T Rx0 w = 0}. Proof: See Appendix C. Based on the above proposition, the asymptotic convergence of the proposed algorithm can be established as follows. Corollary 1: Let β = 1 for the proposed algorithm, if w(t) visits the domain of attraction infinitely with probability one, then w(t) asymptotically converges to ± v1 with probability one as t → ∞. This result follows immediately from Proposition 3 and Ljung’s theorem [28]. V. S IMULATION R ESULTS

(22)

To establish our main result, a lemma is firstly stated. Lemma 1: For each generalized eigenvector vl , l ∈ I M of matrix pencil (Rxτ , Rx0 ), {w|vlT Rx0 w = 0} and {w|vlT Rx0 w = 0} are invariant sets for the ODE (22). Proof: Because the generalized eigenvector matrix V is invertible, we can always write w as a linear combination of vi with the coefficients αi , i ∈ I M w(t) =

where scalar θl is defined as

(25)

In this section, some simulation results are presented to demonstrate the performance of the proposed quasi-Newton algorithm. The first example serves to show the transient behavior and stable convergence of the proposed algorithm for extracting one single AR source. In the second example, we investigate the steady state performance of the proposed algorithm for sequentially blind extraction of all sources with deflation technique. The last example is devoted to an experiment on the extraction of real-world electrocardiogram (ECG) data. Liu’s [19], [23] and Cichocki’s [22] gradient descent algorithms are compared with these obtained results. To evaluate the performance of the blind extraction algorithms, we use the interference to signal ratio (ISR) as our performance measure, which is defined as [19]  

N 2 1 i=1 ci −1 (29) ISR = 10 log10 N − 1 max{ci2 } where c = wT A = [c1 , . . . , c N ] is the global mixing separating vector. Clearly, if vector c has one and only one nonzero entry, then the corresponding algorithm results in perfect extraction of one source. It is worth noting that if the deflation procedure is used to sequentially extract all the sources, then vector c should be calculated using the associated new mixing matrix instead.

i=1

(26)

Straightforward algebraic manipulations lead to α˙ l (t) = θl (t)αl (t)

(27)

A. Extraction of One Single Source: Transient Behavior and Successful Rate In this example, the sources are N = 3 temporally colored Gaussian signals, which were generated by three first-order

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

AR processes si (t) = αi si (t − 1) + ei (t), i = 1, 2, 3 where αi is the AR coefficient, which is drawn from a uniform distribution in the range (0, 1), ei (t) is a zero-mean white Gaussian driving sequence with variance 1 − αi2 such that the resulting source signal has unit variance. M = 4 mixtures are observed in white Gaussian noise background with variance σ 2 , then the signal to noise ratio (SNR) is defined as 10 log10 σ −2 . The mixing matrix A is randomly generated, whose entries are drawn independently from a standard normal distribution. The settings for the extraction algorithms are as follows. The gradient algorithms based on the linear prediction work in a similar manner. The difference is that Liu’s algorithm uses a fixed predictor, whereas Cichocki’s algorithm uses a learning one. To perform a meaningful comparison, it is important to make sure that the competitors extract the same source signal. For the proposed algorithm, the extraction order depends only on the sources themselves. From (7), the first extracted signal will be the one with the maximum autocorrelation. It can be readily shown that the autocorrelation of the sources are ri (τ ) = αiτ , i = 1, 2, 3. Therefore, the source signal generated using αm = max{α1 , α2 , α3 } (we call it desired source) will be extracted by the proposed algorithm. To extract the same source signal, we use the optimal prediction filter with coefficients [1, −αm ] for Liu’s algorithm, and Cichocki’s algorithm uses the above coefficients as the initial value of prediction filter. Note that the optimal filter is actually the whitening filter for the desired source, which guarantees that the desired source has the minimum MSPE. Both of the gradient algorithms need to recursively estimate the power of the extracted signal, the corresponding forgetting factor β y is set to 0.99. Another important parameter for the gradient algorithms is the step size μ. It is well known that this parameter represents the tradeoff between the convergence speed and steady state accuracy. To obtain the better performance, we use an exponentially decreased step size μ(t) = 0.002 × 0.9998t for gradient algorithms, which is also used for the learning of the predictor in Cichocki’s algorithm. Note that Cichocki’s algorithm requires the prewhitening of observations; this is done by standard principal component analysis (PCA) [2]. Because the prewhitening procedure can practically improve the robustness of the blind extraction algorithm, Liu’s algorithm with prewhitening [23] is also considered in the comparison. For the proposed algorithm, we simply use the time delay τ = 1. Note that the forgetting factor β affects the performance of our algorithm, it is usually chosen close to 1 in learning algorithms to make a better balance between tracking capability and final accuracy. Without losing generality, we empirically set β = 1 and use the initialization in Remark 1, where κ = 1. All competitors adopt the same initial extraction vector w(0) = [1, . . . , 1]T . Fig. 2 shows the evolution of averaged ISR versus the iteration number, where we perform 500 independent trials. From Fig. 2, we observe that the gradient learning algorithms behave similarly and converge with a linear rate. In contrast, the proposed algorithm converges much faster than the gra-

Fig. 2. Average ISR of four algorithms for extracting one single source. Each mixture is polluted by an additive white Gaussian noise with SNR = 20 dB. TABLE I RUNNING T IME T C OF OVERALL I TERATIONS AND S UCCESSFUL R ATE C ORRESPONDING TO F IG . 2

dient ones. This is because that the proposed algorithm uses the second-order Newton iteration, which commonly exhibits quadratic convergence rate in the neighborhood of the solution. Besides the convergence speed, actual running time is also compared and listed in the middle row of Table I. One sees that the proposed algorithm consumes slightly more time for overall iterations. However, the convergence of our algorithm needs only one third of overall iterations, whereas the gradient algorithms need about two thirds of overall iterations. Therefore, the proposed algorithm is more efficient. We have also observed in some individual runs that the gradient algorithms occasionally result in unacceptable solutions in many cases. This is because they are prone to diverge or converge to a solution with poor ISR performance, which actually corresponds to failed extraction. We have to point out that the failed extractions of each algorithm have been excluded from the average in Fig. 2. The corresponding successful rate is listed in the last row of Table I. We see that the proposed algorithm acts at a higher successful rate than the gradient algorithms. Therefore, it is necessary to further investigate the successful rate of the extraction algorithms. A trial is said to be successful if the ISR learning curve do not diverge and converges to an acceptable solution with reasonable ISR performance. To this end, an ISR threshold should be chosen empirically, the choice of this threshold depends mainly on SNR provided that the extraction algorithms converge. Fig. 3 shows the successful rate for each algorithm when SNR varies, where the source signals are three first order AR processes with α1 = 0.9, α2 = 0.55, α3 = 0.3 fixed, the ISR threshold thr is chosen as a function of SNR, i.e., thr|SNR= = −(5 + (3/4)) dB. Observing Fig. 3, we find that the successful rate increases with the growth of SNR, which is a common property of four competitors. For a fixed SNR, the proposed algorithm yields higher successful rate than

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHANG et al.: ADAPTIVE QUASI-NEWTON ALGORITHM

Fig. 3. Successful rate of the algorithms for extracting one single source. Five hundred Monte Carlo trials are performed for each SNR.

7

Fig. 4. Evolution of ISR versus the elapsed time for extracting one single source with SNR = 40 dB. TABLE II S UCCESSFUL R ATE C ORRESPONDING TO F IG . 4

the gradient algorithms, especially in the high SNR case, our successful rate even achieves 97%. In addition, we see that the whitening process practically improves the successful rate for Liu’s algorithm. One can also notice that the successful rates listed in Table I are inferior to that shown in Fig. 3. This is not surprising because the AR coefficient αi of each source is randomly generated in Fig. 2, this may results in some difficult cases, where two sources have very similar spectrum because the AR coefficients αi are probably close to each other. In addition, keep in mind that in above simulations, the gradient algorithms use the optimal predictor, which is impossible in practical applications. Note that our algorithm has a higher computational load than the gradient algorithms, we also investigate the convergence performance versus the elapsed time for extracting one source from large scale mixtures. We synthetically generate N = 60 third-order AR sources with the model H (z) = (1/(1 − p1 z −1 )(1 − p2 z −1 )(1 − p3 z −1 )), where the poles p1 , p2 , p3 for each source are randomly chosen from the range (−1, 1). Fig. 4 shows the averaged convergence behavior of competitors versus the elapsed time over 500 independent trials, where the mixing matrix A and the initial value of extraction vector w(0) in each trial are randomly generated. It is shown in Fig. 4 that inspite of the higher computational load for each iteration, the proposed algorithm is still the most efficient one for source extraction in large scale situation because it spends the least time to achieve the same performance. In addition, we see that the final performance of gradient algorithms is much inferior to the proposed one. We conjecture that this may result from two aspects: 1) the slow convergence speed of the gradient learning algorithms, they need more time to achieve the same good performance and 2) the nonoptimal predictors. Unlike the result in Fig. 2, where we use the optimal predictors for gradient learning algorithms, whereas in Fig. 4, the predictors are randomly generated in each trial. The optimal one is unavailable in practical situations. Table II lists the corresponding successful rate of each competitor. One sees that the proposed quasiNewton algorithm win the highest successful rate. Therefore, the proposed algorithm greatly improves the convergence behavior and successful rate of CCA-based BSE method.

Fig. 5. Source signals (s1 , s3 ) are two colored Gaussian signals with different autocorrelation and (s2 , s4 ) are two biological signals.

B. Extraction of All Sources: Steady State Performance In this example, we test our algorithm for sequentially extracting all the sources using the deflation procedure in Section III-C. The source signals are four benchmark signals shown in Fig. 5 taken from the file ABio7.mat provided by the ICALAB toolbox [29]. M = 4 mixtures are observed without noise, where the mixing matrix is randomly generated and given by ⎡ ⎤ 0.547 0.276 −1.071 0.615 ⎢ 0.911 −0.327 0.512 −1.158 ⎥ ⎥ A=⎢ ⎣ −0.273 −0.418 −0.571 −0.595 ⎦ . 0.757 −1.623 −2.361 0.408 As the prewhitening procedure improves the successful rate of Liu’s algorithm, we here only consider Liu’s algorithms with prewhitening. The step size is μ = 0.0002 for Liu’s algorithm and μ = 0.002 for Cichocki’s algorithm. The coefficients of a randomly generated linear predictor for Liu’s algorithm are

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 6.

Sequentially extracted signals. (a) Liu’s algorithm. (b) Cichocki’s algorithm. (c) Proposed algorithm.

Fig. 7.

Learning curve of each signal. (a) Learning of s4 . (b) Learning of s2 . (c) Learning of s3 . (d) Learning of s1 .

given by b = [−0.943 − 0.955 0.843 0.269 − 0.854 −0.169 1.772 1.355 − 0.654 0.425] .

(30)

This predictor also serves as the initial value of Cichocki’s algorithm for the extraction of first source, and current predictor will be used as the initial value for the next source. The other settings are the same as those in previous example. Fig. 6 shows the sequentially extracted sources by three competitors. We see that three competitors can successfully extract the sources, but they have different extraction order. For Liu’s algorithm, the order depends on the MSPEs of sources. Using the predictor in (1), the normalized MSPEs of sources are (9.675, 10.226, 11.559, 0.028), respectively. Therefore, the extraction order is s4 , s1 , s2 , s3 as shown in Fig. 6(a). For the proposed algorithm, the order depends only on the normalized time delayed autocorrelation of source signals, details are shown in Section II, which is given by (0.0367, 0.8422, 0.6273, 0.9997). Then, the order is s4 , s2 , s3 , s1 . Because Cichocki’s algorithm uses a learning

predictor, its extraction order cannot be determined in advance. We conjecture that the order is closely related to the initial value of the predictor for each signal. Fig. 7 shows the corresponding learning curve of each source signal. Although our algorithm need more time to complete the overall iterations, it is the most efficient one to achieve a specified performance. In addition, except the extraction of the last signal, where Cichocki’s algorithm prevails, our algorithm yields the best final accuracy for other three sources. This merit is valuable because it mitigates the error accumulation in deflation-based blind source separation. Note that the ISRs corresponding to the last extracted signals of three competitors keep constant. This is not surprising because the number of mixtures decreases as the source signals are extracted. For the last extracted signal, the extraction vector degenerates into a scalar and the corresponding mixing matrix becomes a row vector. In such a case, the global mixing separating vector c only changes in its norm, thus the ISR is fixed because it is invariant with respect to the norm of c.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHANG et al.: ADAPTIVE QUASI-NEWTON ALGORITHM

Fig. 8.

9

Evolution of the cost functions for extracting the source signals (a) s4 , (b) s2 , (c) s3 , and (d) s1 .

Fig. 9. Convergence curve of the extraction vector w for the first extracted signal s4 .

Besides the performance comparison, we also investigate the corresponding cost function of the proposed algorithm for extracting each source and the convergence behavior of the extraction vector. Fig. 8 shows the evolution of the cost functions for each source signal, where the theoretical value is computed using the result in Proposition 1, and λi is the normalized time delayed autocorrelation of source signals given in (7). We see from Fig. 8 that the cost function associated with each source converges rapidly to its theoretical value, the convergence needs ∼103 iterations for each signal. To investigate the convergence property of the extraction vector w, we take the first extracted signal (i.e., s4 ) as an example. We measure the distance between w and ± v1 , and a valid definition using the projection matrix is the fit error   −1 −1  d(w, ± v1 ) = 10 log10  wT w ww T − v1T v1 v1 v1T  F . Fig. 9 shows the evolution of fit error versus the iterations. We find that the extraction vector w converges rapidly to

Fig. 10. ECG signals taken by abdominal (the first five signals) and thoracic (the last three signals) measurements from a pregnant woman.

± v1 within ∼2000 iterations. The above results validate our theoretical analysis. C. Extraction of ECG Data In this example, we test our algorithm for the real-world data—the well-known ECG data measured from a pregnant woman and distributed by De Moor and collected in ICALAB [29]. The data are shown in Fig. 10, which consists of 10 s of cutaneous potential recordings from M = 8 sensors sampled at 500 Hz. One can see the strong and slow heart beating of mother and the weak and fast one of fetus. The task is to extract the desired ECG. On one hand, because the fetus electrocardiogram (FECG) is so weak that it is almost submerged by mother ECG (e.g., the first four recordings) and noise (e.g., the 5th recording). Hence, the FECG is one of the desired ECG. On the other hand, although some recordings are mainly devoted to mother electrocardiogram (MECG),

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

Fig. 11.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Autocorrelation of the last signal in Fig. 10.

they are contaminated by noises to some extent (e.g., the 5th recording). Therefore, the clearer MECG without noise is another desired ECG in this experiment. To extract the desired ECG, the a priori knowledge of the source of interest must be exploited. This actually corresponds to a semiblind extraction problem. According to [13], we may use the periodicity of the source of interest. Assume that the fundamental period of FECG and MECG are PF and PM samples, respectively, which can be determined by the autocorrelation of a certain recordings. Fig. 11 shows the autocorrelation of last recording, where the fetal influence is clearly stronger than in the other recordings. We find that the autocorrelation have two peaks at 0.404 and 0.7 s, respectively, which correspond to PF = 202 and PM = 350 samples. To exploit the estimated period of source of interest, for the proposed algorithm, we just set τ = PF to extract the FECG and τ = PM to extract the MECG. The forgetting factor is set to β = 0.8. For the prediction-based gradient algorithms, instead of using multiple time delays, we use a single time delay τ = PF for FECG and τ = PM for MECG, which corresponds to use a predictor b = [0, . . . , 0, 1] with τ − 1 zeros. The learning rate is μ = 0.002 for Liu’s algorithm and μ = 0.01 for Cichocki’s algorithm. Another important issue is the prewhitening of recordings for the gradient algorithms, which requires to estimate the number of sources N in advance. However, it is usually impossible in the experiment. We here empirically choose N = 5 for FECG and N = 2 for MECG. Besides the above online learning algorithms, two batch learning algorithms (Barros’s algorithm [13] and the SDA [11]) are also run as a reference. Fig. 12 shows the extracted FECG by various algorithms. One can see that with the exception of Liu’s algorithm, which fails to extract the FECG, the others can successfully extract the FECG. The fetal heart rate is ∼150 beats/min. Among these successful extractions, the two batch learning algorithms win the best extraction performance because the interference from the MECG is well suppressed. The three online learning algorithms extract the FECG with the MECG residue more or less. Compared with two gradient learning algorithms, our proposed algorithm has less MECG residue. Although the batch learning algorithms have the better extraction performance, they are unsuitable to some real time applications because they cannot work until a sample sequence is acquired. In addition,

Fig. 12. Extracted FECG. (a) Barros’s algorithm [13]. (b) SDA algorithm [11]. (c) Liu’s algorithm [19]. (d) Liu’s algorithm with prewhitening [23]. (e) Cichocki’s algorithm [22]. (f) Proposed Newton algorithm.

Fig. 13. Extracted MECG. (a) Barros’s algorithm [13]. (b) SDA algorithm [11]. (c) Liu’s algorithm [19]. (d) Liu’s algorithm with prewhitening [23]. (e) Cichocki’s algorithm [22]. (f) Proposed Newton algorithm.

the SDA algorithm alternately learns three sets of parameters, we have found in our experiment that it is very sensitive to the initial value of these parameters. For the two gradientbased online learning algorithms, they are very sensitive to the whitening process. We find that they may fail if the number of sources N is set to other values (i.e., N = 5) in the whitening process. Although our proposed algorithm yields regular extraction performance in the experiment. Fig. 13 shows the extracted MECG signals. One sees that Liu’s algorithm without prewhitening wins the best performance, our algorithm also offers a clearer MECG signal with good noise suppression. Whereas the others, including two batch learning algorithms, provide inferior MECG extraction performance. It should be pointed out that SDA algorithm may fail for MECG extraction because it is still sensitive to the initial value of parameters. For the two online learning

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHANG et al.: ADAPTIVE QUASI-NEWTON ALGORITHM

11

algorithms with prewhitening, their extraction performances actually depend mainly on the principle component analysis because we set N = 2 in the whitening process for MECG extraction. We further find that if N > 2 is used for whitening, the two online learning algorithms fail to extract the MECG. VI. C ONCLUSION In this paper, we have revisited the CCA approach for adaptive blind extraction of mutually uncorrelated but temporally colored signals. Although some online realizations of this approach have been proposed using the linear prediction, the resulting algorithms are gradient search method, suffering from slow convergence rate. In addition, the convergence of the gradient learning algorithms is unresolved. This often leads to failed extraction. To overcome these drawbacks, we introduce a novel criterion for BSE. The minima of this criterion are analyzed, which proves the equivalence between the CCA criterion and proposed one. Through approximating the Hessian matrix, a fast and efficient quasi-Newton algorithm is derived. Using the Lyapunovs method, we show that the proposed algorithm asymptotically converges to the global minimum of the criterion. Simulation results are presented to verify our analysis and show the fast convergence and high successful rate of the proposed algorithm. Although the proposed algorithm has a higher complexity O(M 2 ), it is still tolerable because it converges much faster than the gradient algorithms. In addition, we have found that because of the error accumulation in deflation technique and the unstable convergence, it is very hard for the gradient algorithms to successfully extract multiple sources. Therefore, the proposed algorithm is more suitable for multiple source extraction because of its fast and stable convergence. Several points seem to be interesting topics. Convergence analysis of the proposed algorithm with small sample size can be anticipated. Why do the gradient learning algorithms converge to their global optimal solutions with such a lower successful rate? A rigorous convergence analysis may answer this question. Could the approximation of Hessian in the proposed algorithm be of interest for other learning problems? The influence of this approximation should be theoretically investigated. A PPENDIX A P ROOF OF P ROPOSITION 1 Let g(w) = ∇w J be the gradient of J (w) with respect to w. It is easy to show   −1  g(w) = 2 Rx0 w − wT Rxτ w Rxτ w . (31) If w = ± vi , i ∈ I M , using (4) and (5) we get   τ g(w)|w=± vi = ± 2 Rx0 vi − λ−1 i Rx vi = 0.

˜ x0 w λR

(33)

 λ˜ = wT Rxτ w.

(34)

=

wT Rx0 w = 1.

(35)

Comparing (4)–(6) with (33)–(35), respectively, we conclude that the stationary point of J (w) corresponds to the generalized eigenvector of matrix pencil (Rxτ , Rx0 ). At each stationary point, it is easy to show J (w)|w=± vi = 1 − log λi , i ∈ I M .

(36)

A PPENDIX B P ROOF OF P ROPOSITION 2 Let H (w) = ∇w ∇wT J be the Hessian matrix of J (w) with respect to w. It is easy to show  T  −2  H (w) = 2 Rx0 + 2 wT Rxτ w Rxτ ww T Rxτ  −1  (37) − wT Rxτ w Rxτ . An evaluation of H (w) at the stationary point w = ± vi , i ∈ I M leads to  −1 τ  τ T τ T H (w)|w=± vi = 2 Rx0 + 2λ−2 i Rx vi vi (Rx ) − λi Rx . (38) Let V = [v1 , . . . , v M ] and D = diag[λ1 , . . . , λ M ] be the generalized eigenvector matrix and eigenvalue matrix, respectively, the generalized egiendecomposition in (4)–(6) can be also written in matrix form VT Rxτ V = D, VT Rx0 V = I

(39)

with V being nonsingular [24]. Substituting Rxτ = V−T DV−1 and Rx0 = V−T V−1 into (38), we obtain    −1 T H (w)|w=± vi = 2 V−T V−1 + V−T 2λ−2 i Dei ei D V  −1 −V−T (λ−1 i D)V  −1  −1 2 T = 2V−T I + 2λ−2 i D ei ei − λi D V   −1 = 2V−T I + 2ei eiT − λ−1 i D V  = 2V−T diag 1 − λλ1i , . . . , 3 − λλii , . . . ,  . . . , 1 − λλMi V−1 (40) where ei is the i th column of I. Equation (40) actually shows the eigendecompostion of the Hessian matrix evaluated at the stationary point. Clearly, H (w) is positive definite iif w = ± v1 . Otherwise H (w) is indefinite. Therefore, w = ± vi , i = 2, . . . , M are saddle points, and w = ± v1 is the unique local minima. In addition, note that J (w) is unbounded when w approaches zero and infinity, thus w = ± v1 is also the global minima, and the cost function attains 1 − log λ1 at this point.

(32)

This indicates that w = ± vi is a stationary point of J (w). Conversely, g(w) = 0 implies Rxτ w

Premultiplying both sides of (33) by wT yields

where scalar λ˜ is defined as

A PPENDIX C P ROOF OF P ROPOSITION 3 Mimicking the convergence analysis of principal component algorithms [27], the proof can be done by finding appropriate Lyapunov function, which decreases along any trajectory of ODE (22). Observing (23), the outline of this proof is to show

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

that the coefficient α1 (t) tends to ± 1 and the others tend to zero. According to Lemma 1, starting inside the domain of attraction , we have α1 (t) = v1T Rx0 w(t) = 0

(41)

A PPENDIX D B OUNDNESS OF w In this Appendix, we discuss the boundness of the extraction vector w under some mild assumptions. It is interesting to note that the proposed quasi-Newton algorithm (14) can be formulated as a powerlike method



for all t. Then, we can define ηi = αi /α1 , i = 2, . . . , M. The time derivative of ηi is given by η˙ i (t) =

α1−2 (t) [α˙ i (t)α1 (t)

− αi (t)α˙ 1 (t)] .

(42)

Substituting (27) into (42) leads to αi (t) = [θi (t) − θ1 (t)] ηi (t) (43) η˙ i (t) = [θi (t) − θ1 (t)] α1 (t) where θi is defined in (28). Let us define a simple scalar function V [ηi (t)] = ηi2 (t) for dynamic system (43). It can be readily shown that V [ηi (t)] ≥ 0, and the time derivative of V is given by V˙ [ηi (t)] = 2ηi2 (t) [θi (t) − θ1 (t)] .

(44)

It can be shown that θi (t) − θ1 (t) =

˜ 3λ(t)(λ i − λ1 ) . M 2 (t) λ˜ 2 (t) + 2 m=1 λ2m αm

(45)

Because λi < λ1 for i = 2, . . . , M, we have θi (t) − θ1 (t) < 0, and hence V˙ [ηi (t)] ≤ 0 with the equality iif ηi (t) = 0, and we have lim ηi (t) = 0 for i = 2, . . . , M. Rigorously t →∞ speaking, we cannot immediately arrive at lim αi (t) = 0, t →∞ i = 2, . . . , M until α1 (t) is proved to remain finite. According to (41), this is equivalent to investigate the boundness of w , more details are given in Appendix D. Thus, w(t) = α1 (t)v1 is the asymptotically stable equilibrium point of (22). For large enough t, θ1 (t) can be simplified as θ1 (t) =

3λ21 α12 (t) λ21 α14 (t) + 2λ21 α12 (t)

−1=

1 − α12 (t) 2 + α12 (t)

.

(46)

Then, the time derivative of α1 (t) reads  −1   1 − α12 (t) α1 (t). α˙ 1 (t) = 2 + α12 (t)

(47)



Now, let us define a variable z(t) = α12 (t) − 1 and a scalar 

function V [z(t)] = z 2 (t), the time derivative of z(t) and V [z(t)] is given by z˙ (t) = −2 [3 + z(t)]−1 z(t) [z(t) + 1]

V˙ [z(t)] = −4 [3 + z(t)]−1 z 2 (t) [z(t) + 1] .

(48) (49)

Clearly, V˙ [z(t)] ≤ 0 with the equality iif z(t) = 0, i.e., α12 (t) = 1. Therefore, V [z(t)] is a Lyapunov function corresponding to (48) and α1 (t) = ± 1 is the asymptotically stable equilibrium point. This completes the proof of Proposition 3.

w(t) = κ(t)w(t − 1) + n(t)

(50)

where n(t) denotes the finite precision error vector because of numerical operation, and the time variant coefficient κ(t) and matrix  are given by κ(t) =

˜ 3λ(t) λ˜ 2 (t)

+ 2wT (t

 = (Rx0 )−1 Rxτ .

− 1)Rxτ (Rx0 )−1 Rxτ w(t − 1)

(51) (52)

Clearly, the boundness of w is in accordance with the stability of the linear system (50). On one hand, if some of the eigenvalues of system matrix κ(t) are always >1, then the linear system (50) is numerically unstable, and w will become infinite when t is very large. On the other hand, if all the eigenvalues of κ(t) are always >1, then the linear system (50) is numerically stable. However, this is also not the desired case because w will vanish as the iteration continue. In fact, this is not the case for the proposed algorithm. It is straightforward that  has the eigenvalues that are equal to the generalized eigenvalues λi , i = 1, . . . , M of matrix pencil (Rxτ , Rx0 ). For κ(t), dropping the time index and substituting (26) and (39) into (51), we have 3 κ= . (53) T −T 2 −1 T τ w Rx w + 2 w TV −TD V−1 w w V

DV

w

Using the boundness of Rayleigh quotient, one has 3 λ1 (wT Rx0 w

+ 2)

≤κ≤

3 λ M (wT Rx0 w

+ 2)

.

(54)

We here focus only on the maximal and minimal eigenvalues of system matrix κ(t). Using (54), we have 3 3 λ1 ≤ κλ1 ≤ wT Rx0 w + 2 λ M (wT Rx0 w + 2) λM 3 3 ≤ κλ M ≤ T 0 . T 0 λ1 (w Rx w + 2) w Rx w + 2

(55) (56)

Recall (7) and Assumption 2, we have 0 < λi ≤ 1, i ∈ I M . Because the time lag τ is chosen close to zero lag, for most of temporal colored source signals, λi , i ∈ I M are all close to 1, i.e., λ M is close to λ1 . Then (55) and (56) implies κλ1 ≈ κλ M ≈

3 (wT Rx0 w

+ 2)

3 . (wT Rx0 w + 2)

(57) (58)

Suppose that w is chosen close to optimal solution, then we have wT Rx0 w ≈ 1. This shows that the maximal and minimal eigenvalues of system matrix κ(t) are close to 1, i.e., all the eigenvalues of system matrix κ(t) are ∼1. Therefore, for linear system (50), w neither becomes infinite NOR vanishes as the iteration continues. Although we present our conclusion

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHANG et al.: ADAPTIVE QUASI-NEWTON ALGORITHM

using some approximations and assumptions, the conclusion is in agreement with our simulation results. R EFERENCES [1] L. Tong, R. Liu, V. C. Soon, and Y.-F. Huang, “Indeterminacy and identifiability of blind identification,” IEEE Trans. Circuits Syst., vol. 38, no. 5, pp. 499–509, May 1991. [2] A. Belouchrani, K. A. Meraim, J.-F. Cardoso, and E. Moulines, “A blind source separation technique using second-order statistics,” IEEE Trans. Signal Process., vol. 45, no. 2, pp. 434–444, Feb. 1997. [3] A. M. Tome, “The generalized eigendecomposition approach to the blind source separation problem,” Digit. Signal Process., vol. 16, no. 3, pp. 288–302, 2006. [4] A. Belouchrani and A. Cichocki, “Robust whitening procedure in blind source separation context,” Electron. Lett., vol. 36, no. 24, pp. 2050–2051, Nov. 2000. [5] A. Hyvarinen, “Fast and robust fixed-point algorithms for independent component analysis,” IEEE Trans. Neural Netw., vol. 10, no. 3, pp. 626–634, May 1999. [6] W.-T. Zhang and S.-T. Lou, “Iterative algorithm for joint zero diagonalization with application in blind source separation,” IEEE Trans. Neural Netw., vol. 22, no. 7, pp. 1107–1118, Jul. 2011. [7] Z. Yang, Y. Xiang, S. Xie, S. Ding, and Y. Rong, “Nonnegative blind source separation by sparse component analysis based on determinant measure,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 10, pp. 1601–1610, Oct. 2012. [8] S. Xie, L. Liu, J.-M. Yang, G. Zhou, and Y. Xiang, “Time-frequency approach to underdetermined blind source separation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 2, pp. 306–316, Feb. 2012. [9] A. Cichocki, T. Rutkowski, A. K. Barros, and S.-H. Oh, “A blind extraction of temporally correlated but statistically dependent acoustic signals,” in Proc. IEEE Workshop Neural Netw. Signal Process., vol. 1. Dec. 2000, pp. 455–464. [10] Y. Li and J. Wang, “Sequential blind extraction of instantaneously mixed sources,” IEEE Trans. Signal Process., vol. 50, no. 5, pp. 997–1006, May 2002. [11] X.-L. Li and X.-D. Zhang, “Sequential blind extraction adopting secondorder statistics,” IEEE Signal Process. Lett., vol. 14, no. 1, pp. 58–61, Jan. 2007. [12] R. Vollgraf, I. Schießl, and K. Obermayer, “Blind source separation of single components from linear mixtures,” in Proc. Int. Conf. Artif. Neural Netw., 2001, pp. 509–514. [13] A. K. Barros and A. Cichocki, “Extraction of specific signals with temporal structure,” Neural Comput., vol. 13, no. 9, pp. 1995–2003, 2001. [14] Z.-L. Zhang and Z. Yi, “Robust extraction of specific signals with temporal structure,” Neurocomputing, vol. 69, nos. 7–9, pp. 888–893, 2006. [15] Z. Malouche and O. Macchi, “Adaptive unsupervised extraction of one component of a linear mixture with a single neuron,” IEEE Trans. Neural Netw., vol. 9, no. 1, pp. 123–138, Jan. 1998. [16] M. Borga and H. Knutsson, “A canonical correlation approach to blind source separation,” Dept. Biomed. Eng., Linkoping Univ., Linkoping, Sweden, Tech. Rep. LiU-IMT-EX-0062, Jun. 2001. [17] O. Friman, M. Borga, P. Lundberg, and H. Knutsson, “Exploratory fMRI analysis by autocorrelation maximization,” NeuroImage, vol. 16, no. 2, pp. 454–464, 2002. [18] Y.-O. Li, T. Adali, W. Wang, and V. D. Calhoun, “Joint blind source separation by multiset canonical correlation analysis,” IEEE Trans. Signal Process., vol. 57, no. 10, pp. 3918–3929, Oct. 2009. [19] W. Liu, D. P. Mandic, and A. Cichocki, “Analysis and online realization of the CCA approach for blind source separation,” IEEE Trans. Neural Netw., vol. 18, no. 5, pp. 1505–1510, Sep. 2007. [20] W. Liu, D. P. Mandic, and A. Cichocki, “A class of novel blind source extraction algorithms based on a linear predictor,” in Proc. IEEE Int. Symp. Circuits Syst., Kobe, Japan, May 2005, pp. 3599–3602. [21] W. Liu, D. P. Mandic, and A. Cichocki, “Blind second-order source extraction of instantaneous noisy mixtures,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 53, no. 9, pp. 931–935, Sep. 2006. [22] A. Cichocki and R. Thawonmas, “On-line algorithm for blind signal extraction of arbitrarily distributed, but temporally correlated sources using second order statistics,” Neural Process. Lett., vol. 12, no. 1, pp. 91–98, 2000.

13

[23] W. Liu, D. P. Mandic, and A. Cichocki, “Blind source extraction based on a linear predictor,” IET Signal Process., vol. 1, no. 1, pp. 29–34, 2007. [24] G. H. Golub and C. F. V. Loan, Matrix Computations, 3rd ed. Baltimore, MD, USA: The Johns Hopkins Univ. Press, 1996. [25] G.-O. Glentis, K. Berberidis, and S. Theodoridis, “Efficient least squares adaptive algorithms for FIR transversal filtering,” IEEE Signal Process. Mag., vol. 16, no. 4, pp. 13–41, Jul. 1999. [26] R. Fletcher, Practical Methods of Optimization, 2nd ed. New York, NY, USA: Wiley, 2000. [27] M. D. Plumbley, “Lyapunov functions for convergence of principal component algorithms,” Neural Netw., vol. 8, no. 1, pp. 11–23, 1995. [28] L. Ljung and T. Soderstrom, Theory and Practice of Recursive Identification. Cambridge, MA, USA: MIT Press, 1983. [29] A. Cichocki, S. Amari, K. Siwek, T. Tanaka, A. H. Phan, and R. Zdunek. (2007, Mar. 28). ICALAB—MATLAB Toolbox Ver. 3 for Signal Processing [Online]. Available: http://www.bsp.brain.riken.jp/ICALAB/

Wei-Tao Zhang received the Ph.D. degree in control science and engineering from Xi’dian University, Xi’an, China, in 2011. He is currently a Lecturer with the School of Electronic Engineering, Xi’dian University. His current research interests include blind signal processing.

Shun-Tian Lou was born in Zhejiang, China, in 1962. He received the B.Sc. degree in automatic control and the M.Sc. degree in electronic engineering from Xi’dian University, Xi’an, China, and the Ph.D. degree in navigation guidance and control from Northwest Polytechnical University, Xi’an, in 1985, 1988, and 1999, respectively. He was a Post-Doctoral Fellow with the Institute of Electronic Engineering, Xi’dian University, from 1999 to 2002. Currently, he is a Professor with the School of Electronic Engineering, Xi’dian University. His current research interests include signal processing, pattern recognition, and intelligent control using neural network and/or fuzzy system.

Da-Zheng Feng (M’02) was born in 1959. He received the Degree from Xi’an University of Technology, Xi’an, China, in 1982, the M.S. degree from Xi’an Jiaotong University, Xi’an, in 1986, and the Ph.D. degree in electronic engineering from Xi’dian University, Xi’an, in 1996. He was a Post-Doctoral Research Affiliate and an Associate Professor with Xi’an Jiaotong University, from May 1996 to May 1998. From May 1998 to June 2000, he was an Associate Professor with Xi’dian University. Since July 2000, he has been a Professor with Xi’dian University. He has published over 80 journal papers. His current research interests include signal processing, brain information processing, image processing, radar techniques, and blind equalization.

Adaptive quasi-Newton algorithm for source extraction via CCA approach.

This paper addresses the problem of adaptive source extraction via the canonical correlation analysis (CCA) approach. Based on Liu's analysis of CCA a...
2MB Sizes 2 Downloads 3 Views