LETTER

Communicated by Evan Archer

¨ Nonparametric Estimation of Kullback-Leibler Divergence Zhiyi Zhang [email protected]

Michael Grabchak [email protected] Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC 28223, U.S.A.

¨ In this letter, we introduce an estimator of Kullback-Leibler divergence based on two independent samples. We show that on any finite alphabet, this estimator has an exponentially decaying bias and that it is consistent and asymptotically normal. To explain the importance of this estimator, we provide a thorough analysis of the more standard plug-in estimator. We show that it is consistent and asymptotically normal, but with an infinite bias. Moreover, if we modify the plug-in estimator to remove the rare events that cause the bias to become infinite, the bias still decays at a rate no faster than O (1/n). Further, we extend our results to estimating ¨ the symmetrized Kullback-Leibler divergence. We conclude by providing simulation results, which show that the asymptotic properties of these estimators hold even for relatively small sample sizes. 1 Introduction Let P = {pk : k = 1, . . . , K} and Q = {qk : k = 1, . . . , K} be two discrete probability distributions on the same finite alphabet, X = {k : k = 1, . . . , K}, ¨ where K ≥ 2 is a finite integer. The Kullback-Leibler divergence, also known as relative entropy, is defined to be D = D(PQ) =

K 

pk ln(pk /qk ) =

k=1

K  k=1

pk ln(pk ) −

K 

pk ln(qk ).

(1.1)

k=1

Here and throughout, we adopt the standard conventions that if q ≥ 0, then ¨ 0 log(0/q) = 0, and if p > 0, then p ln(p/0) = ∞. The Kullback-Leibler divergence is the distance between two probability distributions, introduced ¨ by Kullback and Leibler (1951), and is an important measure of information in information theory. (For details, see Cover & Thomas, 2006.) This distance is not a metric since it does not satisfy the triangle inequality and is ¨ not symmetric. A symmetric modification of Kullback-Leibler divergence is considered in section 4. Neural Computation 26, 2570–2593 (2014) doi:10.1162/NECO_a_00646

c 2014 Massachusetts Institute of Technology 

¨ Nonparametric Estimation of Kullback-Leibler Divergence

2571

In this letter, we consider nonparametric estimation of D(PQ) when P and Q are unknown. Mori, Nishikimi, and Smith (2005) considered the case when the distribution Q is known. Another varient was studied in Marcon, H´erault, Baraloto, and Lang (2012). Perhaps the most commonly studied variant of this problem is when P and Q are continuous distribu´ 2009; Nguyen, Wainwright, & Jordan, tions (see Wang, Kulkarni, & Verdu, ¨ 2010). It should be noted that when estimating the Kullback-Leibler divergence in the continuous case, one often discretizes the data, and hence our framework is applicable to such situations. Assume that P and Q are unknown. Let X1 , . . . , Xn be an independent and identically distributed (i.i.d.) sample of size n from X according to P, and let Y1 , . . . , Ym be an i.i.d. sample of size m from X according to Q. Assume of each other. Let {xk = n further that the two samples are independent m 1[X =  ] : k = 1, . . . , K} and {y = 1[Y = k ] : k = 1, . . . , K} be i k k i i=1 i=1 the sequences of observed counts, and let Pˆ = { pˆ k = xk /n : k = 1, . . . , K} and Qˆ = {qˆk = yk /m : k = 1, . . . , K} be the sample proportions. We make the following assumptions: A1: pk > 0 and qk > 0 for k = 1, . . . , K. A2: there exists a λ ∈ (0, ∞) such that n/m → λ. Note that we are not assuming that K is known, only that it is some finite integer greater than or equal to two. Perhaps the most intuitive estimator of D(PQ) is the so-called plug-in estimator given by ˆ =D ˆ (PQ) = D n n

K 

pˆ k ln( pˆ k ) −

k=1

K 

pˆ k ln(qˆk ).

(1.2)

k=1

In section 2.1 we show that this estimator is consistent and asymptotically normal, but with an infinite bias. Then in section 2.2, we introduce a modification of this estimator and show that it is still consistent and asymptotically normal, but now with a finite bias. We further show that this bias decays no faster than O (1/n). In section 3, we introduce an estimator of D(PQ) given by ⎡ m−yk  K v    1 yk ˆ  (PQ) = ˆ =D ⎣ ˆ D 1 − p k n n v m− j+1 k=1

j=1



 v  1 x −1 ⎦ , 1− k v n− j

n−xk



v=1

v=1

(1.3)

j=1

and show that it is consistent and asymptotically normal, and that its bias decays exponentially fast. In section 4, we derive the corresponding results

2572

Z. Zhang and M. Grabchak

¨ for the symmetrized Kullback-Leibler divergence. Finally, in section 5, we perform a small-scale simulation study to see how well our asymptotic results hold for finite sample sizes. ˆ and D ˆ n appear to depend on K. However, Note that the estimators D n for any letter k with xk = 0, we have pˆ k = 0, and thus we do not need to include it in the sum. For this reason, the sum only needs to be taken over letters that have been observed; hence, knowledge of K is not required. Before proceeding, let

A=

K 

pk ln(pk ) and B = −

k=1

K 

pk ln(qk ),

(1.4)

k=1

and note that D(PQ) = A + B. The first term, A, is the negative of the entropy of the distribution P. Entropy is a measure of randomness, first introduced in Shannon (1948). The problem of estimating the entropy (and thus −A) has a long history, going back to at least Miller (1955). For an overview of the literature on entropy estimation see Paninski (2003) and the references therein. More recent results can be found in Zhang (2012, 2013), Zhang and Zhang (2012), and Zhang and Grabchak (2013). We conclude this section by introducing some notation. We use := to indicate a defining equality, · to indicate the integer part of a nonegative p L real number; → and → to denote convergence in law and convergence in probability, respectively; Bin(m, p) to denote a binomial distribution with parameters m and p; N(μ, σ 2 ) to denote a normal distribution with mean μ and variance σ 2 ; and MVN(μ, ) to denote a multivariate normal distribution with mean vector μ and covariance matrix . For any two functions f and g taking values in (0, ∞) with limn→∞ f (n) = limn→∞ g(n) = 0, we write f (n) = O (g(n)) to mean 0 < lim inf n→∞

g(n) g(n) ≤ lim sup < ∞. f (n) f (n) n→∞

(1.5)

f (n) ≤ O (g(n)) is defined as in equation 1.5, but allowing for equality at ∞ (but strict inequality at 0), and O (g(n)) ≤ f (n) is defined as in equation 1.5, but allowing for equality at 0 (but strict inequality at ∞). 2 Preliminaries In this section, we present some results about the plug-in estimator and its modifications. These results appear to be missing from the literature. While we will need them to prove the asymptotic normality of the estimator given in equation 1.3, they are of independent interest.

¨ Nonparametric Estimation of Kullback-Leibler Divergence

2573

ˆ , 2.1 Properties of the Plug-in Estimator. In this section we show that D n the plug-in estimator given in equation 1.2, is a consistent and asymptotically normal estimator of D, but with an infinite bias. Define the (2K − 2)dimensional vectors, v = (p1 , . . . , pK−1 , q1 , . . . , qK−1 )τ and vˆ n = ( pˆ 1 , . . . , pˆ K−1 , qˆ1 , . . . , qˆK−1 )τ , p

and note that vˆ n → v as n → ∞. Moreover, by the multivariate normal approximation to the multinomial distribution, √ L n(vˆ n − v) → MVN(0, (v)),

(2.1)

where (v) is the (2K − 2) × (2K − 2) covariance matrix given by (v) =



1 (v)

0

0

2 (v)

.

(2.2)

Here, 1 (v) and 2 (v) are (K − 1) × (K − 1) matrices given by ⎛ ⎜ 1 (v) = ⎜ ⎝

p1 (1 − p1 )

−p1 p2

···

−p2 p1 ··· −pK−1 p1

p2 (1 − p2 ) ··· −pK−1 p2

··· ··· ···

−p1 pK−1



⎟ −p2 pK−1 ⎟ ⎠ ··· pK−1 (1 − pK−1 )

and ⎛ ⎜ 2 (v) = λ ⎜ ⎝

q1 (1 − q1 )

−q1 q2

···

−q2 q1 ··· −qK−1 q1

q2 (1 − q2 ) ··· −qK−1 q2

··· ··· ···

−q1 qK−1



⎟ −q2 qK−1 ⎟. ⎠ ··· qK−1 (1 − qK−1 )

Let G(v) =

K−1  k=1

 K−1 K−1     pk ln pk − ln qk + 1 − pk pk ln 1 −

− ln 1 −

K−1  k=1

 qk

k=1

k=1

2574

Z. Zhang and M. Grabchak

and g(v) := ∇G(v)  τ ∂ ∂ ∂ ∂ = G(v), . . . , G(v), G(v), . . . , G(v) . ∂ p1 ∂ pK−1 ∂q1 ∂qK−1 For each j, j = 1, . . . , K − 1, we have   K−1 K−1   ∂ G(v) = ln p j − ln q j − ln 1 − pk − ln 1 − qk ∂ pj k=1



k=1



= ln p j − ln q j − ln pK − ln qK = ln

pj qj

− ln

pK qK

and  pj pj ∂ 1 − K−1 p k=1 pk G(v) = − + = − + K. K−1 ∂q j qj q q 1 − k=1 qk j K The delta method gives the following result: Proposition 1. Provided that g τ (v)Σ(v)g(v) > 0,  √  1 L ˆ n − D [g τ (v)Σ(v)g(v)]− 2 → n D N(0, 1).

(2.3)

Remark 1. It is well-known that (v) is a positive-definite matrix (see Tanabe & Sagae, 1992). For this reason, the condition gτ (v)(v)g(v) > 0 is equivalent to the condition that g(v) = 0. It is straightforward to see that this holds if and only if P = Q. By the continuous mapping theorem, (vˆ n ) is a consistent estimator of (v) and g(vˆ n ) is a consistent estimator of g(v). From this, Slutsky’s theorem, and remark 1, we get the following: Corollary 1. Provided that P = Q,   1 L √  ˆ n − D g τ (ˆvn )Σ(ˆvn )g(ˆvn ) − 2 → n D N(0, 1).

(2.4)

ˆ is a consistent and asymptotically normal estimator of D, Although D n it has an infinite bias. To see this, we note that P({ pˆ k > 0} ∩ {qˆk = 0}) > 0 for every k, k = 1, . . . , K, and hence E( pˆ k ln(qˆk )) = ∞. It is clear that the infinite bias is an artifact of events with vanishingly small probability. Specifically, the problem stems from the fact that qˆk may

¨ Nonparametric Estimation of Kullback-Leibler Divergence

2575

be zero. To better understand the behavior of the plug-in, we will see what happens when we ignore such events. To do this, we modify the estimator of qk to make sure that it is never zero. While there are many ways to do this, we introduce the following augmentation: qˆ∗k = qˆk +

1[yk = 0] , k = 1, . . . , K. m

(2.5)

2.2 Properties of the Augmented Plug-in Estimator. Define an augmented plug-in estimator of D by ˆ∗ = D n

K 

pˆ k ln( pˆ k ) −

k=1

K 

pˆ k ln(qˆ∗k ) =: Aˆ n + Bˆ ∗n .

(2.6)

k=1

ˆ and D ˆ n , this appears to depend on K, but, in fact, As with the estimators D n knowledge of K is not required. ˆ ∗ is a consistent and asymptotically norIn this section, we show that D n mal estimator of D with a bias that decays no faster than O (1/n). Our goal is not to suggest that this is a useful estimator, but to help us understand the behavior of the plug-in. Moreover, these results will be instrumental in deriving the corresponding properties of the estimator proposed in equation 1.3. Proposition 2. Provided that P = Q,  √  ∗ 1 L ˆ n − D [g τ (v)Σ(v)g(v)]− 2 → n D N(0, 1).

(2.7)

Proof. In light of proposition 1 and remark 1, it suffices to show that √ p ˆ )→ ˆ∗ −D 0. The fact that for any > 0, n(D n

n

K K   √ ∗ ˆ −D ˆ )| > ) ≤ P P(| n(D [y = 0] ≤ P(yk = 0) n k n k=1

=

K 

k=1

(1 − qk )m → 0,

k=1

gives the result. By arguments similar to the proof of corollary 1, we have the following: Corollary 2. Provided that P = Q,   1 L √  ∗ ˆ n − D g τ (ˆvn∗ )Σ(ˆvn∗ )g(ˆvn∗ ) − 2 → n D N(0, 1), where vˆ n∗ = ( pˆ 1 , . . . , pˆ K −1 , qˆ 1∗ , . . . , qˆ K∗ −1 )τ .

(2.8)

2576

Z. Zhang and M. Grabchak

Example 1. In applications, a commonly used special case is when K = 2. In this case, P and Q are both Bernoulli distributions with probabilities of success being p and q, respectively. Note that in general, q = 1 − p. In this case, v = (p, q)τ , 

 p(1 − p) 0 , 0 λq(1 − q)     q− p τ p(1 − q) , , g(v) = ln q(1 − p) q(1 − q)

(v) =

and σ 2 := gτ (v)(v)g(v) reduces to    p(1 − q) 2 λ(p − q)2 + . σ 2 = p(1 − p) ln q(1 − p) q(1 − q)

(2.9)

By propositions 1 and 2, it follows that √

ˆ − D) → N(0, σ 2 ) and n(D n L



ˆ ∗ − D) → N(0, σ 2 ). n(D n L

ˆ ∗ . Miller (1951) (see also Paninski, 2003, for We now discuss the bias of D n a more formal treatment) shows that E[Aˆ n ] − A ≥ 0 and E[Aˆ n ] − A = O (1/n).

(2.10)

We will show that for large enough n, E[Bˆ ∗n ] − B ≥ 0 and E[Bˆ ∗n ] − B ≥ O (1/n).

(2.11)

Once this is established, we immediately get the following: Proposition 3. For large enough n, ˆ n∗ ] − D ≥ 0 and E[ D ˆ n∗ ] − D ≥ O(1/n). E[ D

(2.12)

The proof of equation 2.11, and hence of proposition 3, will be based on the following result for general plug-in estimators of B: Lemma 1. Let q˜ k be any estimator of qk such that q˜ k is independent of pˆ k and K 0 < q˜ k ≤ 1 a.s. for all k = 1, . . . , K . If c k = E[q˜ k ] and B˜ = − k=1 pˆ k ln q˜ k then   K    1 ˜ E B −B≥ pk ln(q k /c k ) + Var(q˜ k ) . 2 k=1

¨ Nonparametric Estimation of Kullback-Leibler Divergence

2577

Proof. By the Taylor expansion of the logarithm, E[ln(ck ) − ln(q˜k )] =

∞  1 1 E[(1 − q˜k )v − (1 − ck )v ] = Var(q˜k ) + Rk , v 2 v=1

where Rk =

∞  1 E[(1 − q˜k )v − (1 − ck )v ]. v v=3

Since pˆ k is an unbiased estimator of pk and it is independent of q˜k , K      E B˜ − B = pk E ln(qk ) − ln(ck ) + ln(ck ) − ln(q˜k ) k=1

=

K  k=1

  1 pk ln(qk /ck ) + Var(q˜k ) + Rk . 2

Since 0 < q˜k ≤ 1, Jensen’s inequality implies that E[(1 − q˜k )v ] ≥ (E[1 − q˜k ])v = (1 − ck )v , and thus Rk ≥ 0. Proof of Proposition 3. It suffices to show equation 2.11. Note that yk ∼ (1−q )m Bin(m, qk ) and ck := E[qˆ∗k ] = qk + ek , where ek = mk . By lemma 1, E[Bˆ ∗n ] − B ≥

K  k=1

  pk ln

qk qk + ek



 1 ∗ + Var(qˆk ) . 2

Note that Var(yk ) + Var(1[yk = 0]) + 2Cov(yk , 1[yk = 0]) 1 Var(qˆ∗k ) = 2 2m2  1  q (1 − qk ) + = k (1 − qk )m − (1 − qk )2m − 2mqk (1 − qk )m 2 2m 2m =

e2 e qk (1 − qk ) + k − k − qk ek , 2m 2m 2

2578

Z. Zhang and M. Grabchak

and for large enough m, by the Taylor expansion of the logarithm,  ln

qk qk + ek

 = − ln(1 + ek /qk ) = −ek /qk + rk ,

where the remainder term rk ≥ 0 by properties of alternating series. This means that for large enough m, E[Bˆ ∗n ]

− B≥

K 

pk

k=1

e2 q (1 − qk ) e + k − k −ek (qk + 1/qk ) + k 2m 2m 2

.

We have  e2 ek e  − k = k 1 − (1 − qk )m ≥ 0, 2m 2 2m and for large enough m, qk (1 − qk ) − ek (qk + 1/qk ) 2m   qk (1 − qk ) q (1 − qk ) q (1 − qk ) −1 ≥ k = k + ek (qk + 1/qk ) , 4m 4(qk + 1/qk )mek 4m where the final inequality follows by the fact that limm→∞ mek = 0. Thus, for large enough m, E[Bˆ ∗n ] − B ≥

1  pk qk (1 − qk ) = O (1/m) = O (1/n), 4m K

k=1

which completes the proof. 3 A New Estimator In this section, we derive the estimator of D given in equation 1.3. We show that it is consistent and asymptotically normal and that its bias decays exponentially fast. Recall that D(PQ) = A + B. We will estimate A and B separately. Since A is the negative of the entropy of P, we can estimate it by the negative of an estimator of entropy. Specifically, we take the negative of the estimator presented in Zhang (2012), which has an exponentially decaying

¨ Nonparametric Estimation of Kullback-Leibler Divergence

2579

bias. This leads to the estimator Aˆ n = −

n−1  1 Z , v 1,v

(3.1)

v=1

where Z1,v

 K v−1  n1+v [n − (1 + v)]!   j 1 − pˆ k − . = pˆ k n! n k=1

(3.2)

j=0

Let B1,n := E(Aˆ n ) − A. Zhang (2012) showed that ∞ K  1 pk (1 − pk )v ≤ O (n−1 (1 − p∧ )n ), v v=n

0 ≤ B1,n =

(3.3)

k=1

where p∧ = min{p1 , . . . , pk }. Thus, the bias decays exponentially fast. Remark 2. The product term in equation 3.2 assumes a value of 0 when v ≥ n − xk + 1. This is because, if v ≥ n − xk + 1, then when j = n − xk , we have (1 − pˆ k − j/n) = 0. Combining this observation with some basic algebra gives Aˆ n = −

K  k=1

 v  1 xk − 1 . 1− pˆ k v n− j n−xk v=1

(3.4)

j=1

We now develop an estimator of B. By the Taylor expansion of the logarithm, B=

m K ∞ K   1 1 pk (1 − qk )v + pk (1 − qk )v =: ηm + B2,m . (3.5) v v v=1

k=1

v=m+1

k=1

First, for v = 1, . . . , m, we derive an unbiased estimator of (1 − qk )v . For each fixed k, k = 1, . . . , K, and each fixed v, v = 1, . . . , m, consider a size-v subsample of the size-m sample from Q. Let {y◦1 , . . . , y◦K } and {qˆ◦1 , . . . , qˆ◦K } be the counts  and the relative frequencies of the subsample. Note that E 1[y◦k = 0] = (1 − qk)v ; thus, 1[y◦k = 0] is an unbiased estimator of (1 − qk )v . Since there are mv different subsamples of size v ≤ m from the sample of size m, the U-statistic Uk,v

 −1  m = 1[y◦k = 0], v

(3.6)

2580

Z. Zhang and M. Grabchak

where the sum is taken over all possible subsamples of size v ≤ m, is also an unbiased estimator of (1 − qk )v . Following the argument of Zhang and Zhou (2010), we note that 1[y◦k = 0] is simply counting the number of subsamples of size v in which the letter k is missing. This count can be summarized as follows: 1. If v ≤ m − yk , the count is 

m − yk v



 v−1  j mv  = 1 − qˆk − . v! m

2. If v ≥ m − yk + 1 the count is zero.   Since, in the second case, v−1 j=0 1 − qˆk −

Uk,v =

(3.7)

j=0

j m

= 0, in both cases we have

 −1 v     v−1  v  m m j yk = . 1 − qˆk − 1− v v! m m− j+1 j=0

(3.8)

j=1

Note that pˆ k is an unbiased estimator of pk and it is independent of Uk,v , which, when 1 ≤ v ≤ m, is an unbiased estimator of (1 − qk )v . Thus, for such   v, an unbiased estimator of Kk=1 pk (1 − qk )v is Kk=1 pˆ kUk,v . This leads to an estimator of B given by Bˆ n =

m K  1 pˆ kUk,v . v v=1

(3.9)

k=1

By construction E(Bˆ n ) = ηm , and by equation 3.5, the bias B2,m = B − E(Bˆ n ) satisfies 0 ≤ B2,m ≤

∞  1 (1 − q∧ )m+1 (1 − q∧ )v = , m+1 (m + 1)q∧

(3.10)

v=m+1

where q∧ = min{q1 , . . . , qK }. Thus, the bias of Bˆ n decays exponentially in m, and hence in n. Combining the fact that the product term in equation 3.8 assumes a value of 0 for any v ≥ m − yk + 1 and some basic algebra shows that we can write Bˆ n =

K  k=1

 v   1 yk . 1− pˆ k v m− j+1 m−yk v=1

j=1

(3.11)

¨ Nonparametric Estimation of Kullback-Leibler Divergence

2581

Remark 3. When constructing our estimator Uk,v we needed to count the number of subsamples that do not have an observation of the letter k . When looking at the estimator Bˆ n in the form given in equation 3.11, this is reflected in the fact that the product term equals 1 when yk = 0. This seems to imply that we need to know the value of K a priori. However, in the estimator, the entire product term is multiplied by pˆ k . Thus, we do not need to find the value of the product if the letter k has not appeared in the sample X1 , . . . , Xn . This means that we do not actually need to know the value of K. The proposed estimator of D(PQ) is given by ˆ  = Aˆ  + Bˆ  . D n n n

(3.12)

Combining equations 3.4 and 3.11, it can be shown that this estimator is as ˆ n is given by in equation 1.3. The bias of D ˆ  − D) = B − B E(D 1,n 2,m n

=

K  k=1

∞ ∞ 1  1 v v pk (1 − pk ) − (1 − qk ) . v v v=n v=m+1

(3.13) Since each part of the bias decays exponentially fast, the bias decays at least exponentially as well. Note that when the distributions and the sample sizes are similar, the bias tends to be smaller. We will now show that the ˆ n is asymptotically normal. estimator D Theorem 1. Provided that P = Q,  √   1 L ˆ n − D [g τ (v)Σ(v)g(v)]− 2 → n D N(0, 1). Before giving the proof, we state the following corollary. Its proof is similar to that of corollary 1. Corollary 3. Provided that P = Q,   √   L ˆ n − D g τ (ˆvn∗ )Σ(ˆvn∗ )g(ˆvn∗ ) −1/2 → n D N(0, 1), where vˆ n∗ is as in corollary 2. To prove theorem 1, we need the following result:

2582

Z. Zhang and M. Grabchak

Lemma 2. If 0 < a ≤ b, then for any integer m ≥ 0, b m − a m ≤ mb m−1 (b − a ).

Proof. Noting that f (x) = xm is convex on the interval (0, ∞) and that f  (b) = mbm−1 , the result follows immediately. √ √ √ ˆ n − D) = n(D ˆ n − D ˆ ∗ ) + n(D ˆ ∗ − D), Proof of Theorem 1. Since n(D n √n  ˆn − by proposition 2 and Slutsky’s theorem, it suffices to show that n(D p ∗ ˆ ) → 0. Note that D n



ˆ −D ˆ ∗) = n(D n n

√ √ n(Aˆ n − Aˆ n ) + n(Bˆ n − Bˆ ∗n ) =: A1 + A2 . p

p

We will show that A1 → 0 and A2 → 0. For a better presentation, the proof is p divided into two parts: part 1 is the proof of A1 → 0 and part 2 is the proof of p p A2 → 0. We note that a proof of A1 → 0 is given in Zhang (2012). However, we include a new proof, which is simpler and significantly different. Part 1: We first consider A1 . By the Taylor expansion of the logarithm we have Aˆ n = −

∞ 

pˆ k

∞  1 (1 − pˆ k )v v v=1

k=1

and hence ⎡ ⎤ n−xk  K v  1  √  − 1 x ⎣(1 − pˆ k )v − ⎦ A1 = n 1− k pˆ k v n− j k=1

+

v=1

K √  n pˆ k k=1

j=1

∞  v=n−xk +1

1 (1 − pˆ k )v . v

Now note that for j = 0, 1, . . . , n − 1 we have    x −1 x 1− k ≥ 1− k n− j n 0≤ j≤

  = 1 − pˆ k if and only if

1 n =: ∗ . xk + 1[xk = 0] pˆ k

(3.14)

¨ Nonparametric Estimation of Kullback-Leibler Divergence

2583

Let Jk = 1/ pˆ∗k and note that Jk ≥ 1. For any integer v ≥ 1, we have ! ! ! v  !  ! !   − 1 x v k ! − 1 − pˆ k !! 1− ! n − j ! j=1 ! ! ! ∧v  ! v  !  J  k ! !   − 1 − 1 x x 0∨(v−J ) k k k ! ! ≤! − 1 − pˆ k 1− 1− ! n − j n − j ! j=1 ! j=1 ! ! !Jk ∧v  !  ! 0∨(v−J )  v ! x −1  k + !! 1 − pˆ k 1− k − 1 − pˆ k !! n− j ! j=1 ! Jk ∧v 

=



  v  0∨(v−J )  xk − 1  x −1 k 1 − pˆ k − 1− k n− j n− j

1−

j=1

j=1

Jk ∧v 

+

 j=1

 0∨(v−J )  v xk − 1  k 1 − pˆ k − 1 − pˆ k . 1− n− j

Thus, ⎡ n−xk Jk ∧v   K 1  √  x −1 ⎣2 (1 − pˆ k )0∨(v−Jk ) 1− k | A1 | ≤ n pˆ k v n− j v=1

k=1



v   j=1

+

j=1

x −1 1− k n− j

⎤ v⎦

− (1 − pˆ k )

∞ 

K √  n pˆ k k=1



v=n−xk +1

1 (1 − pˆ k )v v

⎤ ⎡ n−xk Jk ∧v   K 1  √  − 1 x ⎣ =2 n pˆ k 1− k (1 − pˆ k )0∨(v−Jk ) − (1 − pˆ k )v⎦ v n− j k=1

v=1

j=1

√  √  + n Aˆ n − A + n A − Aˆ n =: 2A1,1 + A1,2 + A1,3 .

Clearly, it suffices to show that E(A1,i ) → 0 for i = 1, 2, 3. We have E(A1,2 ) → 0 by equation 3.3 and E(A1,3 ) → 0 by equation 2.10. Now observe that

2584

Z. Zhang and M. Grabchak

A1,1 ≥ 0 and

⎤ ⎡ n−xk Jk ∧v   K 1  √    − 1 x v ⎣ A1,1 ≤ n (1 − pˆ k )0∨(v−Jk ) − 1 − pˆ k ⎦ 1− k pˆ k v n−1 k=1

v=1

n−x

j=1

K k 1 √  ≤ n pˆ k v k=1

v=1

k=1

v=1



x −1 1− k n−1

J ∧v k



− 1 − pˆ k

J ∧v



k

  n−xk   K 1 √  xk − 1 (Jk ∧v)−1 n − xk ≤ n (Jk ∧ v) 1 − pˆ k v n−1 n(n − 1) √ K n  n  1 ≤ pˆ k Jk n−1 v k=1

v=1

√   " n K n  pˆ k 1 ≤ 1 + dx n−1 pˆ∗k 1 x k=1

√ K n  xk = [1 + ln n] n−1 xk + 1[xk = 0] k=1 √ n ≤ K [1 + ln n] , n−1 where the third line follows by lemma 2 and the fact that 1 − Thus, E(A1,1 ) → 0. Part 2: We now consider A2 . Note that # A2 =

xk −1 n−1

≥ 1 − pˆ k .

⎡ ⎤ m−y K n  √ ⎣ k 1 U + ln qˆ∗k ⎦ . pˆ k m m v k,v k=1

v=1

p

Since pˆ k → pk for every k = 1, . . . , K and n/m → λ, by Slutsky’s theorem, it √ m−y p suffices to show that m[ v=1 k v1 Uk,v + ln(qˆ∗k )] → 0 for each k = 1, 2, . . . , K. By the Taylor expansion of the logarithm, √



⎤ ⎡ ⎤ m−yk ∞  1  1    √ 1 v 1 − qˆ∗k ⎦ m⎣ U + ln(qˆ∗k )⎦ = m ⎣ U − v k,v v k,v v m−yk v=1

v=1

v=1

⎧ ⎫ m−y ⎬  √ ⎨ k 1  = m Uk,v − (1 − qˆ∗k )v 1[qˆk > 0] ⎩ ⎭ v v=1

¨ Nonparametric Estimation of Kullback-Leibler Divergence

2585

⎧ ⎫ m−y ⎬ √ ⎨k 1 + m [Uk,v − (1 − qˆ∗k )v ] 1[qˆk = 0] ⎩ ⎭ v v=1

√ − m

∞  v=m−yk +1

1 (1 − qˆ∗k )v v

=: A2,1 + A2,2 + A2,3 . p

First, we show that A2,2 → 0. When qˆk = 0, we have Uk,v = 1, qˆ∗k = 1/m, and thus 0 ≤ A2,2

    m √  1 1 v 1− 1− 1[qˆk = 0] ≤ m3/2 1[qˆk = 0]. = m v m v=1

p

Hence, 0 ≤ E(A2,2 ) ≤ m3/2 (1 − qk )m → 0, and therefore A2,2 → 0. Now note that A2,3 ≤ 0 and ⎧ ⎫ m−y  v  ⎬ √ k 1 ⎨ yk A2,1 = m 1− − (1 − qˆ∗k )v 1[qˆk > 0] ⎭ v⎩ m− j+1 v=1

j=1

m−y



v  √ k 1  1 − qˆk − (1 − qˆ∗k )v 1[qˆk > 0] = 0, m v v=1

since qˆk = qˆ∗k when qˆk > 0. To complete the proof, we will show that E(A2,1 + A2,3 ) → 0. We have A2,1 + A2,2 + A2,3 =



⎤ m  1    1 v U − 1 − qk ⎦ m⎣ v k,v v ⎡

m−yk v=1

v=1

∞  ∞ 1  v √ 1 ∗ v (1 − qˆk ) − 1 − qk − m v v v=1

v=1

∞ v √  1 1 − qk − m v v=m+1

=: A2,1,1 − A2,1,2 − A2,1,3 . m−yk 1 m 1 E(A2,1,1 ) = 0 because v=1 v v=1 v Uk,v is an unbiased estimator of (1 − q )v by construction. By the Taylor expansion of the logarithm A2,1,2 =  √  k m ln(qk ) − ln(qˆ∗k ) and E(A2,1,2 ) → 0 by an argument similar to the proof

2586

Z. Zhang and M. Grabchak

of proposition 3. Finally E(A2,1,3 ) → 0 because 0 ≤ A2,1,3 ≤

∞ v √  1 1 (1 − qk )m m → 0. 1 − qk ≤ √ v qk m v=m

Hence E(A2,1 + A2,2 + A2,3 ) → 0, but since E(A2,2 ) → 0 is already estabp lished, we have E(A2,1 + A2,3 ) → 0, which implies A2,1 + A2,3 → 0. From here, the result of the theorem follows. ¨ 4 Symmetrized Kullback-Leibler Divergence ¨ One of the limitations of Kullback-Leibler divergence is that it is not symmetric. This can make its use in certain applications appear artificial. For this ¨ reason, it is useful to consider a symmetric variant of Kullback-Leibler di¨ vergence. This was recognized already in Kullback and Leibler (1951). (See ¨ Veldhuis, 2002, for more recent applications.) The symmetrized KullbackLeibler divergence is defined to be 1 [D(PQ) + D(QP)] 2 K K K K   1  1  = pk ln(pk ) − pk ln(qk ) + qk ln(qk ) − qk ln(pk ) . 2 2

S = S(P, Q) =

k=1

k=1

k=1

k=1

(4.1) We can define the plug-in estimator of S(P,Q) by  1ˆ ˆ (QP) Dn (PQ) + D Sˆn = Sˆn (P, Q) = m 2 K K K K   1  1  pˆ k ln( pˆ k ) − pˆ k ln(qˆk ) + qˆk ln(qˆk ) − qˆk ln( pˆ k ) , = 2 2 k=1

k=1

k=1

k=1

(4.2) and the augmented plug-in estimator of S(P,Q) by  1  ˆ∗ ˆ ∗ (QP) Dn (PQ) + D Sˆ∗n = Sˆ∗n (P, Q) = m 2 K K K K   1  1  ∗ ∗ pˆ k ln( pˆ k ) − pˆ k ln(qˆk ) + qˆk ln(qˆk ) − qˆk ln( pˆ k ) . = 2 2 k=1

k=1

k=1

k=1

(4.3)

¨ Nonparametric Estimation of Kullback-Leibler Divergence

2587

Here q∗k is as in equation 2.5 and pˆ∗k = pˆ k + 1[xk = 0]/n for k = 1, . . . , K. Let Gs (v) be a function such that 2Gs (v) =

K−1  k=1









pk ln pk − ln qk + 1 −

× ln 1 − + 1−

K−1  k=1

K−1 







ln 1 −

k=1

pk

k=1

pk − ln 1 −



qk

K−1 

K−1 



k=1 K−1 

+

qk



K−1 

  qk ln qk − ln pk

k=1



qk − ln 1 −

k=1

K−1 

 pk

,

k=1

and let gs (v) = ∇Gs (v). For each j, j = 1, . . . , K − 1, we have pj ∂ 1 pK G (v) = − ln ln − ∂ pj s 2 qj qK qj 1 qK ∂ G (v) = − ln − ln ∂q j s 2 pj pK

1 2 1 2



qj

q − K pj pK pj

p − K qj qK

and .

By arguments similar to the proofs of propositions 1 and 2 and corollaries 1 and 2, we get proposition 4 and corollary 4: Proposition 4. Provided that P = Q,  − 1 L √ n( Sˆ n − S) gsτ (v)Σ(v)gs (v) 2 → N(0, 1) and  − 1 L √  ∗ n Sˆ n − S gsτ (v)Σ(v)gs (v) 2 → N(0, 1). Corollary 4. Provided that P = Q,  − 1 L √ n( Sˆ n − S) gsτ (ˆv )Σ(ˆv )gs (ˆv ) 2 → N(0, 1) and  − 1 L √  ∗ n Sˆ n − S gsτ (uˆ ∗ )Σ(uˆ ∗ )gs (uˆ ∗ ) 2 → N(0, 1), where uˆ ∗ = ( pˆ 1∗ , . . . , pˆ ∗K −1 , qˆ 1∗ , . . . , qˆ K∗ −1 )τ .

2588

Z. Zhang and M. Grabchak

Example 2. Continuing from example 1 where P and Q both have Bernoulli distributions, with probabilities of success being p and q, we have √

n(Sˆn − S) → N(0, σs2 ) and

where σs2 =

L



n(Sˆ∗n − S) → N(0, σs2 ), L

    p(1 − q) p(1 − p) p−q 2 ln + 4 q(1 − p) p(1 − p)     q(1 − p) λq(1 − q) q− p 2 + . ln + 4 p(1 − q) q(1 − q)

ˆ (PQ) has an infinite bias as an estimator of D(PQ) Remark 4. Since D n ˆ (QP) has an infinite bias as an estimator of D(QP), it follows that and Dm ˆ Sn has an infinite bias as an estimator of S(P,Q). Similarly, from proposition 3, it follows that Sˆ∗n has a bias that decays no faster than O (1/n). By modifying the estimator given in equation 1.3, we can get an estimator of S(P,Q), whose bias decays exponentially fast. Let  1  ˆ ˆ  (QP) Sˆn = Sˆn (P, Q) = Dn (PQ) + D m 2 ⎡ ⎤ m−y  n−x  K v  v  k 1  yk xk − 1 ⎦ 1  ⎣ k 1  − 1− 1− = pˆ n 2 v m− j+1 v n− j k=1

v=1

j=1



v=1

j=1

v=1

j=1

⎤ n−x  m−y  K v  v  k 1  xk yk − 1 ⎦ 1  ⎣ k 1  qˆn 1− − 1− . + 2 v n− j+1 v m− j v=1

k=1

j=1

(4.4) By equation 3.13, the bias of this estimator is ∞ K ∞   1  1  v v ˆ pk (1 − pk ) − (1 − qk ) E Sn − S = v v v=n k=1

+

K  k=1



qk

v=m+1

∞  1 1 v v (1 − qk ) − (1 − pk ) , v v v=m ∞ 

v=n+1

which decays at least exponentially fast as n, m → ∞ since each part does. Theorem 2. Provided that P = Q,  − 1 L √   n Sˆ n − S gsτ (v)Σ(v)gs (v) 2 → N(0, 1).

¨ Nonparametric Estimation of Kullback-Leibler Divergence

2589

¨ Table 1: Kullback-Leibler Divergence. Divergence D(PQ) D(QP) S(P,Q)

True Value .09513 .14452 .11983

Proof. By proposition 4 and Slutsky’s theorem, it suffices to show that √ √ p ˆ n (PQ) − D ˆ ∗ (PQ)) n(Sˆn − Sˆ∗n ) → 0. This follows from the facts that n(D n √ p p ˆ n (QP) − D ˆ ∗ (QP)) → 0, which follow from the proof of → 0 and n(D n theorem 1. By arguments similar to the proof of corollary 1, we have the following: Corollary 5. Provided that P = Q,  − 1 L √   n Sˆ n − S gsτ (uˆ ∗ )Σ(uˆ ∗ )gs (uˆ ∗ ) 2 → N(0, 1), where uˆ ∗ is as in corollary 4. 5 Simulation Results In this section we present a small-scale simulation study in order to verify the results obtained in the letter. For these simulations, we consider the distributions P = {pk : k = 0, 1, 2, . . . , 99} where pk =

 1  100 − k , k = 0, 1, 2, . . . , 99 5050

and Q = {qk : k = 0, 1, 2, . . . , 99} where qk =

 1  100 − .5k , k = 0, 1, 2, . . . , 99. 7525

¨ Table 1 gives the values of the Kullback-Leibler divergence and sym¨ metrized Kullback-Leibler divergence for these distributions. Note that, as expected, D(PQ) = D(QP). The simulations were performed as follows. For a fixed sample size n, we simulated n observations from distribution P and n observations from distribution Q. We then estimated D(PQ) using both the augmented plugˆ ∗ and the proposed estimator D ˆ n . We did not consider the in estimator D n ˆ ∗ are those ˆ since the only cases where it differs from D plug-in estimator D n n where it takes the value of infinity.

2590

Z. Zhang and M. Grabchak

Figure 1: Bias of the estimators and effectiveness of the 95% confidence interˆ  (or Sˆ in panels 3a and 3b) vals. The solid line corresponds to the estimator D n n ˆ and the dashed line coresponds to the estimator D∗n (or Sˆ∗n in panels 3a and 3b).

To estimate the bias, we found the difference between the true value and the estimated value for both estimators. We then repeated this 5000 times for the same sample size and averaged the differences together. This was repeated for all sample sizes ranging from 10 to 1000 in increments of 10. The results are given in Figure 1, panel 1a. In the plot, the dashed line ˆ ∗ and the solid line represents represents the absolute value of the bias of D n  ˆ n has a significantly ˆ the absolute value of the bias of Dn . We see that D  ∗ ˆ n has a very small bias even for small sample ˆ . In fact, D smaller bias than D n sizes. In a similar way, we estimated D(QP). Plots of the absolute value of ˆ ∗ is the bias in this case are given in Figure 1, panel 2a. In this case, D n competitive for smaller sample sizes, but has a significantly larger bias than

¨ Nonparametric Estimation of Kullback-Leibler Divergence

2591

ˆ n for larger sample sizes. We also estimated the symmetrized Kullback¨ D  ∗ ˆ ˆ Leibler divergence S(P,Q) by Sn and Sn . The plots of the biases in this case are given in Figure 1, panel 3a. As we might expect, the biases are in between ˆ n those of the other two cases. In all three cases, we see that the biases of D  ˆ and Sn are small and, especially for larger sample sizes, significantly smaller ˆ ∗ and Sˆ∗ . than the biases of D n n One of the main applications of the asymptotic normality results given in this letter is the ability to construct confidence intervals. Specifically, in light of corollary 3, an asymptotic 95% confidence interval for D(PQ) is given by 

ˆ  + 1.96 √σˆ ˆ  − 1.96 √σˆ , D D n n n n

 (5.1)

where  1 σˆ = gτ (vˆ n∗ )(vˆ n∗ )g(vˆ n∗ ) 2 . ˆ∗ A similar asymptotic confidence interval can be constructed based on D n by using corollary 2. Likewise, we can construct asymptotic confidence intervals for S(P,Q) based on Sˆn using corollary 5 or based on Sˆ∗n using corollary 4. We will now use simulation to see how well these asymptotic confidence intervals work for finite sample sizes. The simulations are performed as before. First, for a fixed sample size ˆ n and D ˆ ∗. n, we get our two random samples and estimate D(PQ) by D n Then we use equation 5.1 to construct asymptotic 95% confidence intervals based on each of these estimators. We repeat this 5000 times and see what proportion of the time the true D(PQ) is in the interval. Plots of the proportions for different sample sizes are given in Figure 1, part 1b. The solid line represents the proportion of the time that the true value is in the confidence ˆ n and the dashed line represents the proportion of the interval based on D ˆ ∗ . Ideally time that the true value is in the confidence interval based on D n both lines should be close to .95. ˆ ∗ performs very poorly We see that the confidence interval based on D n except for small sample sizes. This appears to be caused by the sizable bias of this estimator. Once the sample size is relatively large, the radius of the confidence interval becomes very small, and because it is smaller than the bias, we cannot capture the true value. On the other hand, the ˆ n works very well and gets close to .95 fairly confidence interval based on D quickly. ˆ ∗ and In a similar way, we get confidence intervals for D(QP) based on D n  ˆ Dn . Plots of the proportion of the time that the confidence interval captures

2592

Z. Zhang and M. Grabchak

the true value are given in Figure 1, panel 2b. Similar results for S(P,Q) are presented in Figure 1, panel 3b. In both cases, we see that the confidence ˆ n and Sˆn do very well, while those based on D ˆ ∗ and Sˆ∗ intervals based on D n n are not useful for large sample sizes. Acknowledgments We thank Lijuan Cao for help in performing the simulations and the anonymous referee whose comments helped to improve the presentation of this letter. References Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd ed.). Hoboken, NJ: Wiley. ¨ Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22(1), 79–86. Marcon, E., H´erault, B., Baraloto, C., & Lang, G. (2012). The decomposition of Shannon’s entropy and a confidence interval for beta diversity. Oikos, 121(4), 516–522. Miller, G. (1955). Note on the bias of information estimates. In H. Questler (Ed.), Information theory in psychology: Problems and methods (pp. 95–100). Glencoe, IL: Free Press. Mori, T., Nishikimi, K., & Smith, T. E. (2005). A divergence statistic for industrial localization. Review of Economics and Statistics, 87(4), 635–651. Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(10), 5847–5861. Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation, 15(6), 1191–1253. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656. Tanabe, K., & Sagae, M. (1992). An exact Cholesky decomposition and the generalized inverse of the variance-covariance matrix of the multinomial distribution, with applications. Journal of the Royal Statistical Society. Series B (Methodological), 54(1), 211–219. ¨ Veldhuis, R. (2002). The centroid of the symmetrical Kullback-Leibler distance. IEEE Signal Processing Letters, 9(3), 96–99. ´ S. (2009). Divergence estimation for multidimenWang, Q., Kulkarni, S. R., & Verdu, sional densities via k-nearest-neighbor distances. IEEE Transactions on Information Theory, 55(5), 2392–2405. Zhang, Z. (2012). Entropy estimation in Turing’s perspective. Neural Computation, 24(5), 1368–1389. Zhang, Z. (2013). Asymptotic normality of an entropy estimator with exponentially decaying bias. IEEE Transactions on Information Theory, 59(1), 504–508.

¨ Nonparametric Estimation of Kullback-Leibler Divergence

2593

Zhang, Z., & Grabchak, M. (2013). Bias adjustment for a nonparametric entropy estimator. Entropy, 15(6), 1999–2011. Zhang, Z., & Zhang, X. (2012). A normal law for the plug-in estimator of entropy. IEEE Transactions on Information Theory, 58(5), 2745–2747. Zhang, Z., & Zhou, J. (2010). Re-parameterization of multinomial distribution and diversity indices. Journal of Statistical Planning and Inference, 140(7), 1731–1738.

Received December 13, 2013; accepted April 25, 2014.

Nonparametric estimation of Küllback-Leibler divergence.

In this letter, we introduce an estimator of Küllback-Leibler divergence based on two independent samples. We show that on any finite alphabet, this e...
367KB Sizes 2 Downloads 4 Views