LETTER

Communicated by Kenneth Kreutz-Delgado

A Fast Algorithm for Learning Overcomplete Dictionary for Sparse Representation Based on Proximal Operators Zhenni Li [email protected] Graduate School of Computer Science and Engineering, University of Aizu, Aizu-Wakamatsu-shi, 965-8580, Japan

Shuxue Ding [email protected] Faculty of Computer Science and Engineering, University of Aizu, Aizu-Wakamatsu-shi, 965-8580, Japan

Yujie Li [email protected] Graduate School of Computer Science and Engineering, University of Aizu, Aizu-Wakamatsu-shi, 965-8580, Japan

We present a fast, efficient algorithm for learning an overcomplete dictionary for sparse representation of signals. The whole problem is considered as a minimization of the approximation error function with a coherence penalty for the dictionary atoms and with the sparsity regularization of the coefficient matrix. Because the problem is nonconvex and nonsmooth, this minimization problem cannot be solved efficiently by an ordinary optimization method. We propose a decomposition scheme and an alternating optimization that can turn the problem into a set of minimizations of piecewise quadratic and univariate subproblems, each of which is a single variable vector problem, of either one dictionary atom or one coefficient vector. Although the subproblems are still nonsmooth, remarkably they become much simpler so that we can find a closed-form solution by introducing a proximal operator. This leads to an efficient algorithm for sparse representation. To our knowledge, applying the proximal operator to the problem with an incoherence term and obtaining the optimal dictionary atoms in closed form with a proximal operator technique have not previously been studied. The main advantages of the proposed algorithm are that, as suggested by our analysis and simulation study, it has lower computational complexity and a higher convergence rate than state-of-the-art algorithms. In addition, for real applications, it shows good performance and significant reductions in computational time.

Neural Computation 27, 1951–1982 (2015) doi:10.1162/NECO_a_00763

c 2015 Massachusetts Institute of Technology 

1952

Z. Li, S. Ding, and Y. Li

1 Introduction Situated at the heart of signal processing, data models are fundamental for stabilizing the solutions of inverse problems. The matrix factorization model has become a prominent technique for various tasks, such as independent component analysis (Hyvarinen, 1999), nonnegative matrix factorization (Lee & Seung, 1999; Cichocki & Phan, 2009), compressive sensing (Baraniuk, 2007), and sparse representation (Elad, 2010; Lewicki & Sejnowski, 2000; Kreutz-Delgado, Murray, & Rao, 2003). For the matrix factorization model, we collect the measurements and stack them as the observation matrix Y := {yi }Li ∈ Rm×L , where m is the dimensionality of each sample yi ∈ Rm . Matrix factorization consists of finding a factorization of the form Y ≈ WH by solving the minimization of Y − WH2F , where the factor W ∈ Rm×r is called the dictionary (its column vectors are called atoms), while the other factor, H ∈ Rr×L , is called the coefficient matrix and Y − WH2F is the approximation error. The factors from different tasks are shown to have very different representational properties. The differences mainly come from two aspects: the constraints imposed, such as sparseness, nonnegativity, and smoothness, and the relative dimensionality of W, which is assumed a priori. The relative dimensionality of matrix factorization has two categories: overdetermined (m > r) systems (Lee & Seung, 1999; Cichocki & Phan, 2009) and underdetermined (m < r) systems (Baraniuk, 2007; Elad, 2010; Lewicki & Sejnowski, 2000; Kreutz-Delgado et al., 2003) of linear equations, which have completely different application backgrounds. We investigate the underdetermined linear inverse problem. The straightforward optimization of underdetermined linear inverse problems, with infinitely many solutions, is an ill-posed problem. The problem can be solved by imposing a sparsity constraint on H, where the sparsity is associated with the overcompleteness (m < r) of W, so that most coefficients in H become zero. Then each signal is represented as a linear combination of a few dictionary atoms, leading to the issue of sparse representation, where the dictionary is overcomplete. Sparse representation is of significant importance in signal processing (Elad, Figueiredo, & Ma, 2010; Jafari & Plumbley, 2011; Fadili, Starck, & Murtagh, 2009; Adler et al., 2012; Caiafa & Cichocki, 2012). A key problem related to sparse representation is the choice of the dictionary on which the signals of interest are decomposed. One simple approach is to consider predefined dictionaries, such as the discrete Fourier transform, the discrete cosine transform, and wavelets (Selesnick, Baraniuk, & Kingsbury, 2005). Another approach is to use an adaptive dictionary learned from the signals that are to be represented, which results in better matching to the contents of the signals. The learned dictionary has the potential to offer improved performance compared with that obtained using predefined dictionaries. Many approaches (Engan, Aase, & Husoy, 1999; Aharon, Elad, & Bruckstein, 2006;

A Fast Algorithm for Learning Overcomplete Dictionary

1953

Dai, Xu, & Wang, 2012; Yaghoobi, Daudet, & Davies, 2009; Bao, Ji, Quan, & Shen, 2014) have been proposed. Most of them use two alternating stages: sparse coding and dictionary update. The sparse coding step finds the best, sparsest coefficient matrix of the training signals. For the sparse coding, the 0 -norm is treated as a sparsity constraint. A typical approach is orthogonal matching pursuit (OMP) (Tropp, 2004). OMP, a greedy algorithm, can find sufficiently good representations. However, this algorithm may not provide good enough estimations of signals, and the computational complexity of this algorithm is very high. Another appealing method for sparse coding is basis pursuit (BP) (Chen, Donoho, & Saunders, 2001) or LASSO (Tibshirani, 1994), which replaces the 0 -norm by an 1 -norm. Gradient-based methods (Yaghoobi, Daudet, & Davies, 2009; Rakotomamonjy, 2013) are very effective approaches to solving 1 -norm optimizations. The gradient method can lead to an iterative soft-thresholding method, which requires more iterations, especially if the solution is not very sparse or the initialization is not ideal. For the dictionary updating, the atoms are updated either one by one or simultaneously. By optimizing a least squares problem, the whole set of atoms is updated at once in Engan, Aase, et al. (1999), but it is not guaranteed to converge. One-by-one atom updating is implemented in (Aharon et al., 2006) through singular value decomposition (SVD); this approach can learn a better dictionary. However, SVD is computationally expensive. The approach in Bao, Ji, Quan, & Shen (2014) updates the dictionary atoms one by one based on a proximal operation method, which is essentially a gradient-based method and converges slowly. Therefore, it is important to develop an efficient strategy to accelerate the dictionary learning process. In this letter, we address the problem of learning an overcomplete dictionary for sparse representation. According to the theory of compressive sensing, the mutual incoherence of the dictionary plays a crucial role in the sparse coding (Sigg, Dikk, & Buhmann, 2012; Lin, Liu, & Zha, 2012; Mailhe, Barchiesi, & Plumbley, 2012; Bao, Ji, & Quan, 2014; Wang, Cai, Shi, & Yin, 2014). If the dissimilarities between any two atoms in the dictionary are high, the sparse representation generated from this dictionary is more efficient. Therefore, we impose the incoherence constraint on the dictionary. For simplicity, we use the 1 -norm for sparsity. Hence, the whole problem is constructed as a minimization of the approximation error with the sparsity regularization on the coefficient matrix and with the coherence penalty on the dictionary atoms (see section 3.1). From the viewpoint of optimization, the whole problem is nonconvex with respect to the dictionary and the coefficient matrix, and nonsmooth because of the sparsity regularization and the coherence penalty; the problem therefore cannot be solved efficiently by using an ordinary optimization method. To address this problem, we have separated the problem into a series of subproblems, each of which is a minimization of a single vector variable (i.e., a univariate function), which can be solved more easily. Remarkably, each single-vector variable function

1954

Z. Li, S. Ding, and Y. Li

has a piecewise quadratic form. Thus, we can apply the proximal operator (PO) (Moreau, 1962) to handle the nonsmoothness of each subproblem and thereby obtain a closed-form solution of each subproblem. This is the case for the dictionary atoms, even though the coherence penalty term has been included, or for the sparse coefficient vectors. To our knowledge, applying PO to the problem with a coherence penalty term has not appeared in the literature. An appealing feature of this method is that the subproblems for atom updating and coefficient updating have similar forms. Both are piecewise quadratic and thus can be solved efficiently by the PO technique. Interestingly, these lead to a closed form of solution explicitly. In this way, the whole problem can be solved efficiently and quickly, and we have developed an algorithm, the fast proximal dictionary learning algorithm (fastPDL). This algorithm does not include optimization updating steps, though it is still a recursive procedure because it treats the atoms and coefficient vectors one by one in each round, and the results from a round can affect the results in the next round recursively. The proposed algorithm gains from directly obtaining the closed-form solutions with low complexity and a high convergence rate, avoiding costly techniques of sparse coding, such as OMP (Tropp, 2004) and BP (Chen et al., 2001), and avoiding the gradient-based method (Rakotomamonjy, 2013; Bao, Ji, Quan, & Shen, 2014) with its slow convergence rate. From the theoretical analysis, the fastPDL algorithm has efficiency far beyond state-of-the-art algorithms. The proposed algorithm is expected to have the following desirable characteristics: • Our dictionary learning problem is considered as the approximation error with the sparsity regularization of the coefficient matrix and the coherence penalty of the dictionary. The coherence of W and the sparsity of H can be flexibly controlled by adjusting the corresponding regularization parameters. r We turn the whole problem into a set of univariate optimization subproblems. Each is piecewise quadratic so that we can apply PO to solve it and give a closed-form solution explicitly. These lead to an algorithm with low computational complexity and high convergence rate r While the PO method has been used to solve sparse coding or compressive sensing problems with an 1 -norm, to our knowledge, no such treatment exists for dictionary learning. In this letter, because we use a coherence penalty term, the dictionary learning subproblem is also nonsmooth, so that ordinary optimization methods become inefficient. We show that the PO technique can also be applied to this problem and can give a closed-form solution. Thus, while the coherence penalty term of the dictionary is incorporated into the dictionary

A Fast Algorithm for Learning Overcomplete Dictionary

1955

learning problem as an additional constraint, it does not increase the complexity of the fastPDL algorithm. The remainder of the letter is organized as follows. In section 2, we introduce the dictionary learning problem, including sparse coding and dictionary update, and provide an overview of the state-of-the-art algorithms. In section 3, we describe and analyze the fastPDL algorithm in detail. Numerical experimental studies described in section 4 clearly establish the practical advantages of the proposed fastPDL algorithm. We show that this algorithm converges significantly faster than state-of-the-art algorithms. We also present two applications, one involving nonnegative signals and the other involving a general real-world valued signal. The letter concludes in section 5. 1.1 Notation. A boldface uppercase letter, such as X, denotes a matrix, and a lowercase letter xij denotes the i jth entry of X. A boldface lowercase letter x denotes a vector, and a lowercase letter xj denotes the jth entry of x. Xi: denotes the ith row, and X: j denotes the jth column of matrix X. XT denotes the transpose of the matrix X. The Frobenius norm of X is defined   as  X F = ( i, j | xi j |2 )1/2 . The  p -norm of X is defined as  X  p = ( i, j |  xi j | p )1/p . The 1 -norm of X and x are defined as  X 1 = i, j | xi j | and   x 1 = i | xi |, respectively.  · 0 is the so-called 0 -norm, which counts the number of nonzero elements. Tr(·) denotes the trace of a square matrix. In this letter, all the parameters take real values, even though we do not state this explicitly each time. 2 Overview of Dictionary Learning Problem Dictionary learning, as a matrix factorization, is a procedure for factorizing the signal data Y ∈ Rm×L as the product of W ∈ Rm×r and H ∈ Rr×L . Generally the problem is solved by the following minimization: min φ(H, W) = Y − WH2F . H,W

(2.1)

However, if m is less than r, the case that we are interested in, the problem becomes ill posed. Therefore, some constraints, such as sparsity, should be imposed on H. Usually the sparsity can be measured by, for instance, the 0 -norm or 1 -norm. It is worth noting that the error function, equation 2.1, is not convex with respect to H and W. Most dictionary learning algorithms attack this problem by iteratively performing a two-stage procedure: sparse coding and dictionary update. Starting with an initial dictionary, the following two stages are repeated until convergence.

1956

Z. Li, S. Ding, and Y. Li

2.1 Sparse Coding. Sparse coding is the process of computing the coefficient matrix H based on the set of signals Y and a known dictionary W by the following problem, min H

L 

 h i 0

subject to Y − WH2F ≤ ε,

(2.2)

i

where ε is a real value. Exact determination of the problem has been proven to be NP-hard. Thus, approximate solutions are considered instead, and several efficient algorithms have been proposed. The simplest ones are matching pursuit (MP) (Mallat & Zhang, 1993) and OMP, which are greedy algorithms. However, these methods may not provide a good enough estimation of signals and are not suitable for high-dimensional problems. The relaxation method uses the relaxation 1 -norm instead of 0 -norm and is suitable for large-scale optimization problems. The most successful approach is BP. The problem is given as min H

L 

 h i 1

subject to Y − WH2F ≤ .

(2.3)

i

Solutions can be found by linear programming methods. For large-scale problems with thousands of signals, it is tractable but very slow. In recent years, several authors have proposed improvements for BP to speed up the algorithm based on the following so-called  p -regularized problem: p

min φ(H) = Y − WH2F + λ  H  p , H

(2.4)

p

where  H  p is a sparsity-including regularizer on H and λ > 0 is a regularization parameter. Thus, the solution to sparse approximation becomes an  p -regularized optimization problem, where p ∈ [0, 1]. When p = 1, it is an 1 -regularized problem that has increasing popularity because the problem is a convex quadratic optimization that can be efficiently solved using, for example, the iterative shrinkage-thresholding algorithm (ISTA) (Daubechies, Defrise, & De Mol, 2004; Beck & Teboulle, 2009), iteratively reweighted least squares (IRLS) (Chartrand & Yin, 2008), LARS (Efron, Hastie, Johnstone, & Tibshirani, 2004), and the work of Lee et al. (Lee, Battle, Raina, & Ng, 2007). 2.2 Dictionary Update. Once the sparse coding task is done, the second stage is to find the best dictionary based on the current H. This procedure is called the dictionary update and optimizes the dictionary based on the current sparse coding: min φ(W) = Y − WH2F . W

(2.5)

A Fast Algorithm for Learning Overcomplete Dictionary

1957

To avoid a trivial result or scale ambiguity, the constraints frequently imposed on W include requiring the whole W to have a unit Frobenius norm (Yaghoobi, Blumensath, & Davies, 2009) or each atom to have unit 2 -norm (Engan, Aase, et al., 1999). One of the differences among the various dictionary learning algorithms lies in the dictionary update stage in which they update either the whole set of atoms at once (Engan, Aase, et al., 1999; Rakotomamonjy, 2013) or each atom one by one (Aharon et al., 2006; Bao, Ji, Quan, & Shen 2014). The procedure for dictionary learning is summarized in algorithm 1. 2.3 State-of-the-Art Algorithms. The method of optimal directions (MOD) is one of the simplest dictionary learning algorithms. In the sparse coding stage, H is solved using OMP (Engan, Aase, et al., 1999) or FOCUSS (Engan, Rao, & Kreutz-Delgado, 1999) MOD to find the minimum of Y − WH2F with fixed H. This leads to the closed-form expression W = YHT (HHT )−1 followed by normalizing the columns of W. K-SVD (Aharon et al., 2006) is a classic and successful dictionary learning approach. In the sparse coding stage, the sparse coding is also solved using OMP. The difference between K-SVD and MOD is in the dictionary update stage. MOD updates the whole set of atoms at once with H fixed, but it is not guaranteed to converge. In contrast, K-SVD, through singular-value decomposition (SVD), updates the atoms one by one and simultaneously updates the nonzero entries in the associated row vector of H. The work of Dai et al. (2012) extends the K-SVD algorithm by allowing, in the dictionary update stage, the simultaneous optimization of several dictionary atoms and the related approximation coefficients. With this possibility of optimizing several atoms at a time, the running efficiency of their algorithm is better than the original K-SVD, although still worse than that of MOD. More recently, Bao, Ji, Quan, and Shen (2014) proposed an alternating proximal method iteration scheme for dictionary learning with an 0 -norm

1958

Z. Li, S. Ding, and Y. Li

constraint. The alternating proximal method, actually called the proximal gradient method, is essentially a gradient-based method. Although it can be an improvement over a pure gradient method because of PO, such gradientbased methods usually have linear convergence, so there is still room for improvement. Rakotomamonjy (2013) proposed an algorithm named Dir that simultaneously optimizes both the dictionary and the sparse decompositions jointly in one stage based on a nonconvex proximal splitting framework instead of alternating optimizations. This approach is another extension of the gradient-based method. Other relevant research is the work on online dictionary learning (Mairal, Bach, Ponce, & Sapiro, 2010; Skretting & Engan, 2010). Online algorithms continuously update the dictionary as each training vector is being processed. Because of this, they can handle very large sets of signals, such as commonly occur, for example, in image processing tasks (Mairal et al., 2010). Skretting and Engan (2010) proposed an extension of the method of optimal direction based on recursive least squares for dealing with this online framework. 3 Fast Dictionary Learning Based on the Proximal Operator In this section, we formulate the dictionary learning problem with constraints over the coefficient matrix and the dictionary and introduce the proximal operator. Then we propose the fastPDL algorithm based on PO and analyze this algorithm, including determining the parameters and the computational complexity. 3.1 Problem Formulation. Inheriting from the definitions and notations given in section 2, we formulate our dictionary learning problem. For simplicity, we consider the convex 1 -norm over the coefficient matrix H ∈ Rr×L as the sparsity constraint as follows:  H 1 =

r  L 

|hi j |.

(3.1)

i=1, j=1

More particularly, we impose an additional constraint on the atoms of W ∈ Rm×r to enhance the incoherence between any two atoms, r  r 

|wTl wk |,

(3.2)

l=1,k=1,k=l

where wl and wk are different column vectors in W. If each column of W is normalized  to theunit 2 -norm, the constraint term in equation 3.2 is equivalent to rl=1 rk=1,k=l cos ϕl,k , which summarizes the cosine values of the angles between any two different atoms. ϕl,k = arcos|wTl wk | is the

A Fast Algorithm for Learning Overcomplete Dictionary

1959

angle between atoms l and k, where 0 < ϕ < π/2. Hence, reducing equation 3.2 means enlarging angles between column vectors in W. This also means enhancing the incoherence of the dictionary. Thus, a dictionary learned with such a constraint will be more efficient. Hence, our dictionary learning can be formulated as the following minimization problem, 1 min φ(W, H) = Y − WH2F + α  H 1 W,H 2 r  r 



| wTl wk |,

(3.3)

l=1,k=1,k=l

s.t.  wk 2 = 1,

∀k = 1, . . . , r,

where α > 0 is a regularization parameter to suppress H1 , which is used to control the sparsity of H. β > 0 is also a regularization parameter and is used to adjust the incoherence of W. The problem, equation 3.3, is nonconvex with respect to W and H and nonsmooth because of the sparsity regularization and the incoherence penalty. 3.2 The Proximal Operator. In our problem, because the regularization term on H and the penalty term on W are both nonsmooth, we will adopt PO to conquer the nonsmooth terms. For the nonsmooth g(x), the proximal operator is defined as Proxg,λ (y) : = arg min x∈R

= arg

 j

φ(x) =

min x j ∈R

1  x − y 22 +λg(x) 2

φ(x j ) =

1  x j − y j 22 +λg(x j ), 2

(3.4)

where parameter λ > 0. Consider g(x) = x1 . This problem is the minimization of a piecewise quadratic function. It is easy to find closed-form solutions based on PO, which has the element-wise property and maps the vector y to the one-dimensional yj (the jth entry of y). The closed-form solution of equation 3.4 can be expressed by  x∗j

= [Proxg,λ (y)] j :=

y j − λsign(y j ), (| y j |> λ) 0,

otherwise

.

(3.5)

The cost function in our dictionary learning is a piecewise quadratic function. Extending PO to fit our problem is the key to obtaining closed-form

1960

Z. Li, S. Ding, and Y. Li

solutions directly. We derive PO for the piecewise quadratic functions (parameter a > 0 and b, c ∈ R are both column vectors) as follows: min x∈R

φ(x) =

1 T ax x − bT x + λg(x), 2

(3.6)

where g(x) = x 1 , and min x∈R

φ(x) =

1 T ax x − xT b + λg(x), 2

(3.7)

 T T T where g(x) = N k |ck x|, ck ck = 1 and ck cr ∈ [−1, 1], k = r. Remarkably, the function 3.6 has the closed-form solution as follows: ⎧    λ ⎨ bj b − sign(b j ), (| b j |> λ) . x∗j = Proxg,λ := a ⎩ a a j 0 otherwise

(3.8)

The function g(x) in equation 3.7 is nonsmooth and different from the form in equation 3.4. However, the principle of minimization for equation 3.7 is similar to that of equation 3.4 because it is also a piecewise quadratic function. We can find the minimal solution for each piece and summarize all the solutions to obtain the global minimizer. The detailed derivation is given in the appendix. Accordingly, the closed-form solution is given as x∗j

   b = Proxg,λ/a a j =

for

bj a



λ λ  cgj γ , cl j sign(cTl b) − a a l∈A∪B

  T    cT b  c b  l   g    s for dictionary learning. FastPDL decomposes the matrices into sets of row vectors or column vectors because it is easy to find the optimal solutions in closed form. K-SVD also performs the same decomposition scheme; however, that dictionary update is performed through an SVD decomposition on Y(k) . The complexity of the SVD calculation of Y(k) is O(mL2 + m2 L + L3 ), which is costly. For fastPDL’s dictionary update step,

1966

Z. Li, S. Ding, and Y. Li

we used the closed-form solution, equation 3.16. The complexity mainly depends on the multiplications of the matrices (YHT and HHT ) and is O(Lr2 + rmL), which is much lower than that of SVD. The complexity of sparse coding is O(rmL + mr2 ), which is also much lower than that of OMP (O(Lrs2 + rmL)) used in the K-SVD and MOD methods. In addition, the supports (nonzero positions) and the entries at the supports in H of fastPDL are both updated, unlike in the MOD and K-SVD methods in which the supports are not changed. This contributes to further improvement in the performance of dictionary learning. Gradient-based methods, such as Dir (Rakotomamonjy, 2013) and Bao’s work (Bao, Ji, Quan, & Shen, 2014), require gradient computations that are O(rmL) and are a bit less than that of fastPDL. On the other hand, because it is a gradient-based method, it requires more iterations than fastPDL, and the convergence rates rely on the step sizes. Hence, the fastPDL algorithm has advantages in complexity and convergence rate and has fewer parameters that require tuning. 4 Simulations In this section, to evaluate the learning capability and efficiency of fastPDL, we present the results of some numerical experiments. The first experiment is on synthetic data generated from a random dictionary taken as ground truth; we then use the proposed algorithm on the data to learn a new dictionary to compare with the ground truth dictionary. In this way, a dictionary learning algorithm should be able to extract the common features of the set of signals, which are actually the generated atoms. Furthermore, we perform another experiment in which white gaussian noise of various signal-to-noise ratios (SNRs) is added to the synthetic signals to evaluate the performance and robustness of the antinoise features.1 In addition, we apply fastPDL to real-world valued signals with noise to verify the applicability of the proposed algorithm. In the experiments, all programs were coded in Matlab and were run within Matlab (R2014a) on a PC with a 3.0 GHz Intel Core i5 CPU and 12 GB of memory, under the Microsoft Windows 7 operating system. 4.1 Dictionary Recovery from Synthetic Data. First, we generated a dictionary W by normalizing a random matrix of size m × r, with independent and identically distributed (i.i.d.) uniformly random entries. Each column was normalized to the unit 2 -norm. The dictionary W was referred to as the ground truth dictionary; it was not used in the learning but only for evaluation. Then a collection of L samples {yi }Li of dimensionality m was synthesized, each as a linear combination of s different columns of the

1 SNRdB is defined as 10 log ( x  /  x − y ), where x and y denote the original 10 signal and the signal polluted by noise, respectively.

A Fast Algorithm for Learning Overcomplete Dictionary

1967

dictionary, with i.i.d. corresponding coefficients in uniformly random and independent positions. Here s also denotes the number of nonzero elements in each column of the coefficient matrix H. Then we applied all algorithms—K-SVD, MOD, Dir of Rakotomamonjy, and the proposed fastPDL—to the generated signals to learn dictionaries and record their performance. Matlab codes for K-SVD algorithm and MOD algorithm are available online (http://www.cs.technion.ac.il/∼elad //software/), as is Matlab code for the Dir algorithm (http://asi.insa-rouen .fr/enseignants/∼arakoto/publication.html/). All algorithms were initialized with the same dictionary (made by randomly choosing r columns of the generated signals, followed by a normalization) and with the same coefficient matrix (randomly generated). ˆ we compared To evaluate the performance of the learned dictionary W, ˆ W obtained by the above algorithms with the ground truth W. Because the ˆ may be different, the comparison was order of the columns in W and W ˆ and W and measuring the made by sweeping through the columns of W distances between their columns via ˆ j |, 1− | wTi w ˆ j was an atom where wi was an atom of the ground truth dictionary W and w

ˆ If the distance was less than 0.01, the atom w of the learned dictionary W. i was regarded as successfully recovered. The recovery ratio for the learned dictionary was calculated by dividing the number of recovered atoms by ˆ the total number of atoms in W. In the experiment, the number of signals L was set to 1500, and the dimensionality m was 20. For the dictionary, the number of atoms r was 50. We varied s from 3 to 12. All trials were repeated 15 times for each algorithm and for various s. For all algorithms, the stopping criterion was based on the relative change of the objective function value in two consecutive iterations, (ob jn − ob jn+1 )/ob jn ≤ ξ , for a small, positive constant ξ . For Dir and fastPDL, ξ was set to 10−7 for the stopping criterion, and the maximum number of iterations was set to 10,000 for both. K-SVD and MOD imposed a very strong sparsity constraint on the data by limiting the number of nonzero elements in H from OMP to a small number. These methods thus yielded signal representations that are not particularly accurate. Hence, ξ was set to ξ = 10−5 for K-SVD and MOD. In addition, the implementation of OMP was very slow. For K-SVD and MOD, the maximal number of iterations was set to a reasonable value (e.g., 3000). These iterative methods ensured that the stopping criterion would be satisfied within the maximal number of iterations and that the algorithms could find the minimum of the objective function. The average recovery ratios and corresponding standard deviations for the learned dictionaries are shown with respect to the number

1968

Z. Li, S. Ding, and Y. Li

Figure 2: Experimental results for the synthetic signal. (a) Average recovery ratios and corresponding standard deviations for the learned dictionaries versus s. (b) Average running time and corresponding standard deviation of each algorithm versus s.

of the nonzero element in each column of H in Figure 2a. The average running time and standard deviation of each algorithm are shown versus the number of the nonzero element in each column of H in Figure 2b. It can be seen that in all cases, fastPDL performed the best for the recovery ratio and the running time. The recovery ratio of fastPDL was much higher than K-SVD and MOD in all cases. In particular, fastPDL was still effective even when the number of nonzero elements s was very high. Furthermore, fastPDL, in computational running time, was remarkably fast and stable compared with K-SVD and MOD. It is clear that K-SVD and MOD, which use two stages, were more costly. Compared with Dir, fastPDL was slightly favorable in the ratios of recovered atoms, and it had a significant advantage in running time. The reason that fastPDL outperformed Dir was that the latter employed a gradient descent method to optimize gradually, which is time-consuming even though Dir’s optimization occurred in one stage only. Obviously the four algorithms described above have different convergence properties. To see the convergence behaviors of all algorithms, we executed these algorithms as many times as possible and determined the reasonable numbers of iteration that could ensure convergence of each algorithm. For s = 3, 3000 iterations for Dir and fastPDL and 1000 iterations were required for K-SVD and MOD because K-SVD and MOD cost much more time than Dir and fastPDL per iteration. Fifteen trials were conducted, and performance results were averaged. With respect to the number of iterations and the running time per iteration, the average recovery ratios for the learned dictionaries are shown in Figure 3. It can be seen that the number of iterations required for convergence for K-SVD and MOD was fewer

A Fast Algorithm for Learning Overcomplete Dictionary

1969

Figure 3: Experimental results for the synthetic signal (s = 3). (a) Average recovery ratios for the learned dictionaries versus iteration number. (b) Average recovery ratios for the learned dictionaries versus running time per iteration.

than for fastPDL in Figure 3a; however, the computational complexity for each iteration for K-SVD and MOD was much greater than that for fastPDL. Hence, the convergence rate of fastPDL versus the running time per iteration was faster than that of K-SVD and MOD in Figure 3b. The computational complexity for each iteration for Dir was a little less than that for fastPDL; however, the number of iterations required for convergence for Dir was greater than that for fastPDL in Figure 3a, because the gradient descent converged slowly. Hence, the convergence rate of fastPDL versus the running time per iteration was faster than that of Dir in Figure 3b. Therefore, fastPDL had significant advantages in convergence rate. As is well known, K-SVD and MOD require a specific and fixed number of nonzero elements in each column of H so that they have the fixed sparsity during iterations. For Dir and fastPDL, the sparsity of the coefficient matrix was adjusted via the regularization parameters. Figure 4a depicts the average recovery ratios and corresponding standard deviations for the learned dictionary by fastPDL with respect to the regularization parameters. The cost function values of fastPDL versus iteration number are shown in Figure 4b for different α. We can see that the recovery ratio was better when α = 0.1. Moreover, the cost function values were smaller and converged faster when α = 0.1. This parameter was fixed during iterations to reduce the number of iterations and computational requirements. The sparsity of coefficient matrix H, which was measured by dividing the number of nonzero elements in H by the number of total elements including zero elements and nonzero elements in H, is shown in Figures 5a and 5b against the number of iterations and the running time per iteration, respectively. Generally lower values of the sparsity of H made the dictionary W more efficient. Obviously the sparsity of fastPDL converged faster and could reach a lower value than Dir, while the sparsities for K-SVD and MOD were not plotted

1970

Z. Li, S. Ding, and Y. Li

Figure 4: Experimental results for the synthetic signal (s = 3). (a) Average recovery ratios and corresponding standard deviations for fastPDL dictionary versus parameter α. Cost function values of fastPDL (approximation error +α∗ sparsity measure +β∗ incoherence measure) versus iteration number for different values of parameter α.

Figure 5: Experimental results for the synthetic signal (s = 3). (a) Sparsity of the coefficient matrix versus iteration number. (b) Sparsity of the coefficient matrix versus running time per iteration. The sparsities of K-SVD and MOD are not plotted because they are fixed (3/50 = 0.06).

because they had the fixed sparsity (3/50) decided by OMP method. In addition, we showed histograms of the effective sparsity (nonzero elements in each column of H) of L = 1500 samples’ representation for fastPDL. In the ground truth H that was randomly generated, the number of nonzero elements in each column was s = 3; hence the number of coefficient vectors with three nonzero elements was 1500 in Figure 6a. Note that the number of coefficient vectors was the same as the number of samples {yi }. The number of coefficient vectors for fastPDL with three nonzero elements was 685 in Figure 6b. It could be seen that the vectors with three nonzero elements

A Fast Algorithm for Learning Overcomplete Dictionary

1971

Figure 6: Experimental results for the synthetic signal (s = 3). (a) A histogram of the effective sparsity of the 1500 signals’ representation for randomly generated H. (b) A histogram of the effective sparsity of the 1500 signals’ representation for fastPDL.

were dominant in the output of fastPDL. Thus, the histogram of fastPDL confirms that the columns of H obtained by fastPDL were sparse. In addition, we made an experiment to show the dependence between the rate of the recovery of atoms and the redundancy ratio of the dictionary. Here, we define the redundancy ratio of the dictionary as m/r, where m is the dimensionality of dictionary atoms and r is the number of the dictionary atoms. In the experiment, r is fixed to 50. If the dictionary can be recovered even though the redundancy ratio is smaller, it means that from fewer linear combinations of atoms, we can recover the sparse coefficient matrix. Then the sparse representation is more efficient. To evaluate the ability of the learning dictionary, along with the reduction of the redundancy ratio as small as possible, we made the dictionary recovery experiment under different redundancy ratios. Fifteen trials were conducted, and performance results were averaged. The average recovery rates and corresponding standard deviations are shown in Figure 7a. We found that the fastPDL algorithm displayed a superior recovery rate to the other algorithms. Especially, the fastPDL was still efficient under m/n = 0.2, while the other algorithms deteriorated greatly. Furthermore, we showed the sparsity of coefficient matrix H in Figure 7b. It can been seen that the sparsity of fastPDL can reach lower values than that of Dir under various redundancy ratios, while the sparsities for K-SVD and MOD were not plotted because they had the fixed sparsities. 4.2 Evaluating the Performance and Robustness against Noise for Different Noise Levels. Besides the noiseless condition, we also performed another experiment in which white gaussian noise at various SNRs corrupted the signals to evaluate the algorithm’s performance and robustness against

1972

Z. Li, S. Ding, and Y. Li

Figure 7: Experimental results for the synthetic signal (s = 3, r = 50). (a) Average recovery rates for the learned dictionaries versus m/r. (b) Sparsity of the coefficient matrix versus m/r.

noise. We added noise with SNR levels of 10 dB, 20 dB, and 30 dB according to the model Y = WH + E, where E was a noise matrix. The settings for the experiment are L = 1500, m = 20, r = 50, and s = 3. For each of the tested algorithms and for various SNR levels, 15 trials were conducted and their results sorted. The average recovery ratios and corresponding standard deviations for the learned dictionaries for these algorithms at levels of 10 dB, 20 dB, 30 dB, and the noiseless case are shown in Figure 8a. The mean and standard deviations for the running time (until the relative change of the objective function value in two consecutive iterations was less than a small positive constant) are also shown in Figure 8b. It can be seen that fastPDL performed the best on dictionary learning at each noise level. Although the differences in recovery ratios for the learned dictionary between fastPDL and Dir under various conditions were small, fastPDL had a remarkable advantage in running time. 4.3 Applications and Performance Evaluations. Dictionary learning for sparse representation of signals has many applications, such as denoising, inpainting, and classification. In this section, we apply fastPDL in one particular application, denoising. Signal denoising refers to the removal of noise from a signal. An image contains only nonnegative signals, while both nonnegative and negative signals occur in audio data. This motivates us to apply fastPDL algorithm to image processing and audio processing. 4.3.1 Image Denoising. Although the proposed fastPDL algorithm does not require the signal to be nonnegative, it can indeed work effectively in such cases, as we now show. In the additive noise model, the white gaussian noise V with various standard deviations σ corrupted the clean image Y0 by Y = Y0 + V. We used a noisy image Y of size 8 × 8 pixels as test data to learn a dictionary of

A Fast Algorithm for Learning Overcomplete Dictionary

1973

Figure 8: Experimental results versus noise. (a) Average recovery ratios and corresponding standard deviations for the learned dictionaries at different noise levels. (b) Average running time and corresponding standard deviations of each algorithm at different noise levels.

size 64 × 256. This dictionary was then used to remove the noise from the noisy observed image Y. The denoising method is based on the method of K-SVD implemented in Elad and Aharon (2006). To provide a fair comparison, all the settings were the same as those used with the K-SVD method. For the dictionary learning stage, we chose an overcomplete DCT dictionary as the initialization to learn the dictionary by fastPDL algorithm, and the number of iterations was set to 10. The sparse coding stage was based on the OMP algorithm. After the whole process of dictionary learning, we used the learned dictionary to reconstruct the images. Some standard test images, including Boats (512 × 512), Barbara (512 × 512), Peppers (256 × 256), and House (256 × 256), were used in the experiment, and one of the reconstruction results is presented in Figure 9. We deliberately showed denoised images and the final adaptive dictionaries for a case with very strong noise, because fastPDL was also effective. Here, we used two quality measures, the peak SNR (PSNR) and the structural similarity (SSIM) (Wang, Bovik, Sheikh, Sapiro, & Simoncelli, 2004), to assess the denoised images. SSIM’s value range is between 0 and 1, and its value equals 1 if Y = Y0 . The average PSNR and SSIM of the denoised images over 10 trials at different noise levels are summarized in Table 1. It shows that in some situations, fastPDL can obtain better results, and the averaged results of fastPDL were very close to those for K-SVD. The averaged running times are summarized in Table 2, which shows that the running time of fastPDL was shorter than that of K-SVD. An interesting phenomenon is that the running time for all the images is reduced with increasing noise levels. This may be because with a higher noise level, the image must be represented with fewer atoms because more atoms are noisy.

1974

Z. Li, S. Ding, and Y. Li

Figure 9: Example of denoising results for the image Peppers with the noise level of 20 dB. The corresponding dictionaries trained from Peppers also are shown. The first items show PSNR values, and the second items the SSIM index. Table 1: PSNR (dB) and SSIM of Denoising Results. Boat

BarBara

Peppers

House

σ / PSNR

PSNR

SSIM

PSNR

SSIM

PSNR

SSIM

PSNR

SSIM

5 / 34.16

37.0255 37.0284 33.3949 33.3943 31.4239 31.442 29.0198 29.0523 25.6511 25.5263

0.9396 0.9396 0.8814 0.8814 0.8357 0.8358 0.7621 0.7626 0.6367 0.6310

37.6528 37.6529 33.9599 33.9241 31.96 31.9022 29.2317 29.1088 25.2143 25.0605

0.9594 0.9595 0.9257 0.9252 0.8971 0.896 0.8333 0.8295 0.6899 0.6846

37.901 37.897 34.2115 34.2357 32.1934 32.1466 29.7746 29.6686 26.1915 25.71391

0.9548 0.9548 0.9226 0.9227 0.8981 0.8970 0.8574 0.8551 0.7764 0.7661

39.4905 39.4765 35.965 35.9705 34.3171 34.2136 32.2081 32.0013 27.9702 27.4141

0.9555 0.9554 0.9062 0.9063 0.8771 0.8764 0.8463 0.8453 0.7604 0.7512

10 / 28.15 15 / 24.61 25 / 20.12 50 / 14.09

Notes: In each cell, top row: K-SVD; second row: fastPDL. Figures in bold show the best result in each cell.

4.3.2 Noise Cancellation of Audio Signal. In this section, we present the application of the proposed fastPDL algorithm to a more general signal, that is, a signal without nonnegativity. An audio signal is an appropriate example. The audio data used in the experiments were recorded from a BBC radio music sampled at 16 kHz and lasting 4.1 s. Different levels of white gaussian noise corrupted the audio data. We considered the corrupted audio as training samples, each being composed of 64 time samples. The learned

A Fast Algorithm for Learning Overcomplete Dictionary

1975

Table 2: Comparison of Running Time in Seconds for Image Signals. σ / PSNR

Boat

5 / 34.16

8.8048 3.9209 4.7485 2.5749 3.3470 1.9409 2.8252 1.6982 2.4674 1.4493

10 / 28.15 15 / 24.61 25 / 20.12 50 / 14.09

Barbara Peppers House 8.8978 4.2362 4.9622 2.534 3.5784 2.023 2.916 1.6823 2.5537 1.4858

8.3413 3.6768 4.5821 2.2583 3.4730 1.8688 2.8641 1.4967 2.3606 1.3119

5.1428 2.5609 3.3541 1.7438 2.7144 1.4860 2.4572 1.3891 2.3212 1.2871

Notes: In each cell, top row: K-SVD; second row: fastPDL. Figures in bold show the best result for each cell.

Figure 10: Example of ISNR for audio data with noise level of 0 dB.

dictionary was of size 64 × 256. The noise cancellation procedure was similar to that used for image denoising, and 10 trials were conducted. The reconstructed result is presented in Figure 10. For the audio signal, we used the improvement in signal-to-noise ratio (ISNR) to evaluate the performance of noise cancellation, ISNR = 10log

E{(x(t) − xn (t))2 } , 2} ˆ E{(x(t) − x(t))

1976

Z. Li, S. Ding, and Y. Li

Table 3: Comparisons of ISNR for Audio Signal.

K-SVD FastPDL

10 dB

0 dB

–10 dB

8.4263 8.4212

9.5779 9.3561

11.8957 11.7532

Note: Figures in bold show the best result in each cell.

Table 4: Comparisons of Running Time in Seconds for Audio Signal.

K-SVD FastPDL

10 dB

0 dB

–10 dB

3.974 2.0012

2.7029 1.5346

2.2987 1.2446

Note: Figures in bold show the best result in each cell.

where x(t) is the original signal, xn (t) is the observed noisy signal, and xˆ is the reconstructed signal. As the reconstructed signal becomes closer to the original signal, ISNR increases. Tables 3 and 4 show that the results of fastPDL are very close to those of K-SVD on ISNR performance, and fastPDL significantly outperforms K-SVD on running time 5 Conclusion The main motivation of this letter was to develop a fast and efficient algorithm for learning incoherent dictionary for sparse representation. The problem was cast as the minimization of the approximation error function with the coherence penalty of the dictionary atoms and the sparsity regularization of the coefficient matrix. For efficiently solving the problem, we turned it into a set of minimizations of piecewise quadratic univariate subproblems, each of which was a single variable vector problem of one dictionary atom or one coefficient vector. Although the subproblems were still nonsmooth, remarkably, we could find the optimal solution in closed form using proximal operators. This led to the so-called fastPDL algorithm. Interestingly, this algorithm updates the dictionary and the coefficient matrix in the same manner, forcing the coefficients to be as sparse as possible while simultaneously forcing the atoms of the dictionary to be as incoherent as possible. We have verified that fastPDL is efficient in both computational complexity and convergence rate. These advantages are achieved because the optimal dictionary atoms and coefficient vectors have been obtained in closed form without iterative optimization. The proposed fastPDL algorithm is quite simple but very efficient. The numerical experiments have shown that fastPDL is more efficient than the state-of-the-art algorithms. In addition, we have described two applications of the proposed algorithm, one

A Fast Algorithm for Learning Overcomplete Dictionary

1977

for a nonnegative signal and the other for a general signal, and have shown that the fastPDL algorithm has a significant advantage in computation time for both of them. Further applications remain as our future work. Appendix: Derivation of the Proximal Operator for the Coherence Penalty Problem 3.7 includes a nonsmooth function, g(x), a complication since it is the summation of |cTk x|, (k = 1, . . . , N). For simplicity, we at first consider a case of N = 1, that is, min x∈R

φ(x) =

a T x x − xT b + λg(x), 2

(A.1)

where g(x) = |cT x|, cT c = 1 and a > 0. Then we derive PO for g(x) of equation A1, min φ(x) ⇐⇒ x∈R

b λ − x ∈ ∂g ⇐⇒ x = Proxg,λ/a a a

  b , a

(A.2)

where Prox denotes PO and ∂g denotes the subgradient to g(x). To compute ∂g, we consider three cases: cT x > 0, cT x < 0, and cT x = 0. For cT x = 0, ∂|cT x| is cγ , where γ ∈ [−1, 1]. Then, we get the solution of equation A.1 in the case of cT x = 0, x=

b λcγ − , a a

(A.3)

since cT x = 0 and cT c = 1, we get |cT b| ≤ λγ , and thus γ = solution A.3 can be rewritten as x=

b c(cT b) − . a a

cT b . λ

Hence, the

(A.4)

The same derivation applies for cT x > 0 and cT x < 0. ∂|cT x| is c for cT x > 0 and −c for cT x < 0, respectively. Then we obtain b λc − , a a b λc x= + , a a x=

for

cT b > λ,

(A.5)

for

cT b < −λ.

(A.6)

Summarizing equations A.4 to A.6, we get the closed-form solution of equation A.1,

1978

Z. Li, S. Ding, and Y. Li

⎧b    ⎨ j − b a x∗j = Proxg,λ/a := bj ⎩ a j − a

λc j a

sign(cT b),

c j (cT b) a

,

(| cT b |> λ)

,

(A.7)

otherwise

where xj and bj are the jth entry of x and b, respectively. Now we derive PO for the nonsmooth g(x) of equation 3.7, where N > 1. In addition, we have cTk ck = 1, ∀k = 1, . . . , N and cTk cr ∈ [−1, 1], r = k. Thus we obtain, b λ T ∂|ck x| ⇐⇒ x = Proxg,λ/a −x∈ a a N

min φ(x) ⇐⇒ x∈R

k

  b . a

(A.8)

N T T To compute k ∂|ck x|, we define d1 =| A = {k | ck x > 0, ∀k ∈ 1 . . . N} |, T d2 =| B = {k | ck x < 0, ∀k ∈ 1 . . . N} | and d3 =| C = {k | cTk x = 0, ∀k ∈ 1 . . . N} |, where | · | is the cardinality of a set and d1 + d2 + d3 = N. Then we get the solution of equation 3.7, x=

b λ λ λ ck + ck − ck γ , − a a a a k∈A

k∈B

γ ∈ [−1, 1].

(A.9)

k∈C

For the case of e ∈ A, we have cTe x > 0 which reduces to cTe b λ T  λ  λ  − ce ck + cTe ck − cTe ck γ > 0. a a a a k∈A

k∈B

(A.10)

k∈C

Since cTk ck = 1 and cTk cr ∈ [−1, 1], r = k, we have, cTe



ck ∈ [−d1 + 2, d1 ];

k∈A

cTe



cTe



ck ∈ [−d2 , d2 ];

k∈B

ck ∈ [−d3 , d3 ].

(A.11)

k∈C

Plugging equation A.11 into A.10, we obtain cTe b > λcTe

 k∈A

ck − λcTe

 k∈B

ck + λcTe



ck γ

k∈C

∈ λ[2 − d1 − d2 − d3 , d1 + d2 + d3 ].

(A.12)

For making equation A.12 more readable, we consider different situations separately. If d1 + d2 + d3 < 2 and d1 + d2 + d3 > 1, equation A.12 says

A Fast Algorithm for Learning Overcomplete Dictionary

λ
2, equation A.12 says cTe b cTe b 0, according to equations A.13 and A.14, λ satisfies λ
0.

(A.15)

Similarly, in the cases of cTf x < 0, ∀ f ∈ B, we have, cTf b d1 + d 2 + d3

< −λ

and cTf b < 0,

(A.16)

and in the case of cTg x = 0, ∀g ∈ C, we have

−λ

A Fast Algorithm for Learning Overcomplete Dictionary for Sparse Representation Based on Proximal Operators.

We present a fast, efficient algorithm for learning an overcomplete dictionary for sparse representation of signals. The whole problem is considered a...
904KB Sizes 0 Downloads 15 Views