Generalization bounds of ERM-based learning processes for continuous-time Markov chains.

1872

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012

Generalization Bounds of ERM-Based Learning Processes for Continuous-Time Markov Chains Chao Zhang and Dacheng Tao, Senior Member, IEEE

Abstract— Many existing results on statistical learning theory are based on the assumption that samples are independently and identically distributed (i.i.d.). However, the assumption of i.i.d. samples is not suitable for practical application to problems in which samples are time dependent. In this paper, we are mainly concerned with the empirical risk minimization (ERM) based learning process for time-dependent samples drawn from a continuous-time Markov chain. This learning process covers many kinds of practical applications, e.g., the prediction for a time series and the estimation of channel state information. Thus, it is significant to study its theoretical properties including the generalization bound, the asymptotic convergence, and the rate of convergence. It is noteworthy that, since samples are time dependent in this learning process, the concerns of this paper cannot (at least straightforwardly) be addressed by existing methods developed under the sample i.i.d. assumption. We first develop a deviation inequality for a sequence of time-dependent samples drawn from a continuous-time Markov chain and present a symmetrization inequality for such a sequence. By using the resultant deviation inequality and symmetrization inequality, we then obtain the generalization bounds of the ERM-based learning process for time-dependent samples drawn from a continuous-time Markov chain. Finally, based on the resultant generalization bounds, we analyze the asymptotic convergence and the rate of convergence of the learning process. Index Terms— Convergence, deviation inequality, empirical risk minimization, generalization bound, Markov chain, rate of convergence, statistical learning theory.

I. I NTRODUCTION

O

NE OF the major concerns in statistical learning theory is obtaining generalization bounds (also called risk bounds or error bounds) of empirical risk minimization (ERM) based learning processes for independently and identically distributed (i.i.d.) samples, which measure the probability that a function, chosen from a function class by an ERM-based algorithm, has a sufficiently small error (see [1], [2]). Generalization bounds have been widely used to study the consistency of ERM-based learning processes for i.i.d. samples [3], the asymptotic convergence of i.i.d. empirical processes [4], and Manuscript received February 14. 2012; revised August 31, 2012; accepted September 5, 2012. Date of publication October 9, 2012; date of current version November 20, 2012. This work was supported in part by the Australian Research Council Discovery Project under DP-120103730. C. Zhang was with the School of Computer Engineering, Nanyang Technological University, 639798 Singapore. He is now with the Center for Evolutionary Medicine and Informatics, Biodesign Institute, Arizona State University, Tempe, AZ 85287 USA (e-mail: [email protected]). D. Tao is with the Centre for Quantum Computation and Intelligent Systems and the Faculty of Engineering and Information Technology, University of Technology, Sydney 2007, Australia (e-mail: [email protected]). Digital Object Identifier 10.1109/TNNLS.2012.2217987

the learnability of learning models [5]–[7]. Generally, there are three essential aspects to obtain generalization bounds of a specific learning process: complexity measures of function classes; deviation (or concentration) inequalities; and symmetrization inequalities related to the learning process. In the scenario of i.i.d. samples, many generalization bounds have been obtained by using deviation (or concentration) inequalities and symmetrization inequalities for a sequence of i.i.d. samples. For example, van der Vaart and Wellner [4] presented generalization bounds based on the Rademacher complexity and the covering number. Vapnik [3] gave generalization bounds based on the Vapnik–Chervonenkis (VC) dimension. Bartlett et al. [8] proposed the local Rademacher complexity and obtained a sharp generalization bound for a particular function class { f ∈ F : E f 2 < βE f, β > 0}. Hussain and Shawe-Taylor [9] showed improved loss bounds for multiple kernel learning. Samples are, however, not always i.i.d. in practice; e.g., some financial, signal, and physical behaviors are temporally dependent. Thus, the aforementioned results on statistical learning theory are not applicable (or at least cannot be straightforwardly applied) to the scenario of non-i.i.d. samples. It is noteworthy that the scenario of non-i.i.d. samples contains a wide variety of cases, and it is impossible to find a unified form to cover all of these cases. Thus, one feasible scheme is to find representative stochastic processes, e.g., mixing processes [10], [11], Lévy processes [12], and Markov chains, which cover quantities of useful cases in the scenario of non-i.i.d. samples, and then to study theoretical properties of the learning process for each representative stochastic process individually. Recently, generalization bounds have been developed for specific learning processes of non-i.i.d. samples. Mohri and Rostamizadeh [10] obtained generalization bounds based on the Rademacher complexity for ϕ-mixing processes and β-mixing processes. Mixing sequences can be deemed as transitions between the i.i.d. scenario and the non-i.i.d. scenario, where the time-dependence between samples diminishes with time. Especially, by utilizing a technique of independent blocks [13], samples drawn from a ϕ-mixing (or β-mixing) process can be transformed to an i.i.d. scenario and then some classical results under the i.i.d. sample assumption can be applied to mixing processes for obtaining the generalization bounds. Mohri and Rostamizadeh [11] investigated the stability bounds for stationary ϕ-mixing and β-mixing processes. Zhang and Tao [12] discussed the generalization bounds for Lévy processes without the Gaussian component. Since Lévy

2162–237X/$31.00 © 2012 IEEE

ZHANG AND TAO: GENERALIZATION BOUNDS OF ERM-BASED LEARNING PROCESSES

processes are strictly time dependent, the classical results based on the i.i.d. sample assumption cannot be applied to this scenario. Therefore, under some specific assumptions on the function class, Zhang and Tao achieved the results by using a specific deviation inequality for Lévy processes without the Gaussian component given in [14]. So far, there has been no attempt to study generalization bounds of learning processes for continuous-time Markov chains. In this paper, we study the theoretical properties of the ERM-based learning process for time-dependent samples drawn from a continuous-time Markov chain (or briefly called the learning process for a continuous-time Markov chain), which covers many kinds of practical problems. We first introduce some notations to formalize the proposed approach. Let X ⊂ R I be an input space and Y ⊂ R J be the corresponding output space. Define Z := X × Y ⊂ R K with K = I + J and assume that Z := {zt }t ≥0 is an undetermined continuous-time Markov chain with a countable state space ⊂ R K , where zt = (xt , yt )T ∈ R K with xt ∈ X and yt ∈ Y for any t ≥ 0. Given a function class G and a time interval [T1 , T2 ], it is expected to find a function g ∗ : X → Y, for any input xt ∈ X (t ∈ [T1 , T2 ]), which can predict the corresponding output yt as accurately as possible. A natural criterion to choose the function g ∗ is the lowest risk caused by some function in G. Given a loss function : Y 2 → R, the target function g ∗ minimizes the expected risk 1 T2 ∗ E( ◦ g ) := (1) (g ∗ (xt ), yt )dPt dt T T1 where T = T2 − T1 and Pt stands for the distribution of zt = (xt , yt )T at time t. Since the distributions of Pt (t ∈ [T1 , T2 ]) are unknown, the target function g ∗ usually cannot be directly obtained by minimizing the expected risk (1). Instead, we apply the ERM inductive principle to handle this problem (see [3]). Given a function class G and a set of time-dependent samples Z1N := N {ztn }n=1 drawn from {zt }t ≥0 in the time interval [T1 , T2 ] with T1 < t1 < · · · < t N < T2 , we define the empirical risk of g ∈ G as N 1 (g(xtn ), ytn ) (2) E N ( ◦ g) := N n=1

which is considered as an approximation to the expected risk (1). Let g N ∈ G be the function that minimizes the empirical risk (2) over G, and then we deem g N as an estimate of g ∗ with respect to the sample set Z1N . It is worth noting that the ERM-based learning process covers many kinds of practical applications, e.g., time-series prediction [15]–[17] and channel state information estimation [18]–[20]. We take the channel state information estimation as an example. In this estimation, x ∈ X ⊂ R I and y ∈ Y ⊂ R J are regarded as the transmit and the receive vectors, respectively. It is reasonable to suppose that Z is an undetermined continuous-time Markov chain with a countable state space, because of the following. 1) The signal behavior is time dependent and many kinds of signals have the Markov property.

1873

2) The continuous-time Markov chain is one of the representative processes and covers many specific cases. 3) Generally, most devices can only output rational numbers in practice. Therefore, it is reasonable to deem z = (x, y)T as a rational vector and the cardinality of the set that contains all possible z is countable accordingly. One of the most frequently used models is y = Hx + n [19], [20], where H and n are the channel matrix and the noise vector, respectively. The corresponding function class G is formalized as G := {x → Hx + n : H ∈ R J × R I , n ∈ R J }. The loss function is selected as the mean square error function, and then the least square estimation is used to find the function that minimizes the empirical risk (2). Moreover, there are also other ERM-based methods proposed for the estimation of channel state information [21], [22]. In the aforementioned learning process, we are mainly interested in the following two types of quantities. 1) One is E( ◦ g N ) − E N ( ◦ g N ), which corresponds to the estimation of the expected risk from an empirical quantity. 2) The other is E( ◦ g N ) − E( ◦ g ∗ ), which corresponds to the performance of the algorithm given a learning model. Recalling (1) and (2), since E N ( ◦ g ∗ ) − E N ( ◦ g N ) ≥ 0, we have E( ◦ g N ) = E( ◦ g N ) − E( ◦ g ∗ ) + E( ◦ g ∗ ) ≤ E N ( ◦ g ∗ ) − E N ( ◦ g N ) + E( ◦ g N ) − E( ◦ g ∗ ) + E( ◦ g ∗ ) ≤ 2 sup E( ◦ g) − E N ( ◦ g) + E( ◦ g ∗ ) g∈G

and thus

0 ≤ E( ◦ g N ) − E( ◦ g ∗ ) ≤ 2 sup E( ◦ g) − E N ( ◦ g). g∈G

This shows that the asymptotical behaviors of the aforementioned two quantities when N goes to infinity can both be described by the supremum (3) sup E( ◦ g) − E N ( ◦ g) g∈G

which is the generalization bound of the learning process for the sample set Z1N drawn from the continuous-time Markov chain {zt }t ≥0. For convenience, we define the loss function class F := {z → (g(x), y) : g ∈ G}

(4)

and call F the function class in the rest of this paper. Given N drawn from {zt }t ≥0 , we briefly denote a sample set {ztn }n=1 for any f ∈ F Et f := f (z)dPt , t >0 (5) and by taking f = ◦ g, we rewrite (2) as E N f :=

N 1 f (ztn ) N n=1

(6)

1874


where Et f stands for the expectation of f taken with respect to zt , and E N f stands for the empirical risk of f with respect N to the given sample set {ztn }n=1 .

Moreover, let Rt (z, z ) (z, z ∈ ) stand for the probability that the state z shifts to z at time t. Given an f ∈ F , we then define for any z ∈ and t > 0 f (z )Rt (z, z ), f ∈F (7) Et(z) f := z ∈

which is the conditional expectation of f at time t given an event that the previous state is z. For any time t > 0, we denote zt − and zt + as the original state and the converted state at time t, respectively. Then, by (5) and (7), the law of total expectation leads to (z) Et f · Pr{zt − = z} (8) Et f = z∈

which implies that there must be two states z = z

∈ such that

Et(z ) f ≤ Et f ≤ Et(z ) f.

(9)

Similarly, (1) implies that there must be two times t1 = t2 ∈ [T1 , T2 ] such that, for any g ∈ G, we have 1 T2 (g(xt1 ), yt1 )dPt1 ≤ (g(xt ), yt )dPt dt T T1 ≤ (g(xt2 ), yt2 )dPt2 . (10) According to (3)–(7), (9), and (10), we arrive at sup E f − E N f

II. P RELIMINARIES

f ∈F

= sup E( ◦ g) − E N ( ◦ g) g∈G

(by taking f = ◦ g)

N 1 T2 1 = sup (g(xtn ), ytn ) (g(xt ), yt )dPt dt − T N T g∈G 1 n=1 ≤ sup (g(xt ), yt )dPt g∈G t ∈[T1 ,T2 ]

− = ≤

N 1 (g(xtn ), ytn ), N n=1 sup Et f − E N f ,

f ∈F t ∈[T1 ,T2 ]

sup

(z) sup Et f − E N f ,

f ∈F z∈ t ∈[T1 ,T2 ]

and thus the supremum sup

[by (10)] [by (5) and (6)]

[by (9)]

(z) sup Et f − E N

f ∈F z∈ t ∈[T1 ,T2 ]

f

Generally, to obtain generalization bounds of a learning process, it is necessary to obtain the related concentration (or deviation) inequalities. Although Joulin [23] has presented the deviation inequalities for curved continuous-time Markov chains, his results are valid for one sample drawn from a Markov chain, and cannot be straightforwardly applied to the sequence of time-dependent samples drawn from a Markov chain. By a martingale method, we extend the results proposed in [23] and obtain a deviation inequality for a sequence of time-dependent samples drawn from a continuous-time Markov chain with a countable state space. We then use the resultant deviation inequality to develop a symmetrization inequality for the sequence. Next, we present the generalization bounds based on the covering number for an ERMbased learning process for time-dependent samples drawn from a continuous-time Markov chain. Afterwards, based on the derived bound, we study the asymptotic convergence and the rate of convergence of the learning process. The rest of this paper is organized as follows. Section II briefs some preliminaries. In Section III, we present a deviation inequality and a symmetrization inequality for a sequence of time-dependent samples drawn from a continuous-time Markov chain. Section IV shows the generalization bounds of the learning process for a continuous-time Markov chain. In Section V, we analyze the asymptotic convergence and the rate of convergence of the learning process. The last section concludes this paper, and some detailed proofs are given in the Appendix.

(11)

(12)

is the most important quantity considered in this paper. It is straightforward to see that an upper bound of the supremum (12) is an upper bound of the generalization bound supg∈G E( ◦ g) − E N ( ◦ g).

In this section, we present the necessary notations and conditions for the following discussions. We refer to [24]–[26] for details of Markov chains and the knowledge on the curvature of stochastic processes. A. Notations The definition of a continuous-time Markov chain is as follows. Definition 1: A stochastic process {zt }t ≥0 taking values from a finite or countable set is said to be a continuous-time Markov chain if for all t, s ≥ 0, z , z

∈ Pr zs+t = z |zs = z,

{zu : 0 ≤ u ≤ s} (13) = Pr zs+t = z |zs = z

holds. In the above definition, (13) shows the Markov property that can be interpreted as “given the present, the past and future are independent.” Assume that {zt }t ≥0 is a regular nonexplosive1 continuoustime Markov chain with a countable state space ⊂ R K . We embed a metric d into . Since the main results of this paper are built under the condition ⊂ R K , we only consider the case that d is the Euclidean metric on R K . 1 Here, the word “regular” is understood in the sense of [24], i.e., the generator matrix is totally stable and conservative [24, p. 2]. The word “nonexplosive” means that the number of transitions in a finite interval of time is finite.


Given a bounded λ-Lipschitz function f , we then define the Lipschitz seminorm as f Lipd := sup

z1 =z2

| f (z1 ) − f (z2 )| , d(z1 , z2 )

z1 , z2 ∈ .

(14)

By using the Lipschitz seminorm, we introduce the definition of the Wasserstein curvature for a Markov chain. Definition 2: Let F be a function class composed of bounded λ-Lipschitz functions. The d-Wassertein curvature with respect to z ∈ at time t > 0 of the continuous-time Markov chain {zt }t ≥0 is defined by (z)

Et f Lipd 1 Ht (z) := − sup ln : f = const . (15) t f ∈F f Lipd We refer to [23] for details of (14) and Definition 2. Next, we denote Q(z1 , z2 ) z ,z ∈ as the generator matrix of the 1 2 Markov chain {zt }t ≥0 . We also denote that for some t > 0 and H ∈ R (H )

Ct

:= sup e−H (t −s)

(16)

0≤s≤t

and Mt(H ) :=

1 − e−2H t 2H

(17)

(H )

where Mt = t, if H = 0. N , we define for Moreover, given a sample set Z1N = {ztn }n=1 any t > 0 t :=

N

ln Ee ztn −zt

(18)

n=1

and then ∗ := max t T1 ≤t ≤T2

(19)

where · stands for the Euclidean norm. B. Conditions The continuous-time Markov chains considered in this paper should satisfy some mild conditions. C1) There exists a constant C1 > 0 such that ⎫ ⎧ ⎨ ⎬ 2 d(z1 , z2 ) Q(z1 , z2 ) ≤ C1 . (20) sup ⎭ ⎩ z1 ∈ z2 ∈

C2)

There exists a constant C2 > 0 such that sup{d(zt − , zt + )} ≤ C2 . t >0

C3)

(21)

There exists a constant H > 0 such that the dWassertein curvature Ht (z) is lower bounded by H for any t > 0 and any z ∈ inf Ht (z) ≥ H.

t >0

(22)

Condition C1) implies that the Markov chain {zt }t ≥0 has a bounded angle bracket for any t ≥ 0. Condition C2) implies that the jump of {zt }t ≥0 is bounded. Condition C3)

1875

was mentioned in [23, Definition 2.1] and a related discussion is given in [23, Remark 2.2]. Here, we present an intuitive explanation to this condition and figure out what kind of continuous-time Markov chains will satisfy Condition C3). According to Definition 2, an alternative of Condition C3) is that, for any t > 0 Et(z) f Lipd ≤ e−H t f Lipd .

(23)

The combination of (14) and (23) leads to (z1 )

sup

z1 =z2

|Et

(z )

f − Et 2 f | | f (z1 ) − f (z2 )| ≤ e−H t sup . d(z1 , z2 ) d(z1 , z2 ) z1 =z2

(24) By ignoring the supremum “supz1 =z2 ,” if f is a nonzero bounded function with supz∈R K | f (z)| ≤ B, we have for any z1 , z2 ∈ R K and t > 0 |Et(z1 ) f − Et(z2 ) f | ≤ 2Be−H t (z ) |Et 1

(25)

(z ) − Et 2

which implies that the discrepancy f f | between the expectations of f with respect to different initial states z1 and z2 exponentially decays to zero, when the time t goes to infinity. Therefore, according to (7), if a continuous-time Markov chain {zt }t ≥0 satisfies Condition C3), intuitively, its transit probabilities at time t for any two different initial states should be equivalent (or at least approximately equivalent) when t is large. III. D EVIATION I NEQUALITIES AND S YMMETRIZATION I NEQUALITIES In this section, we utilize a martingale method to develop a deviation inequality for a sequence of time-dependent samples drawn from a continuous-time Markov chain, and then obtain a symmetrization inequality for such sequence. A. Deviation Inequalities Deviation (or concentration) inequalities play an essential role in obtaining the generalization bounds for a certain learning process. Generally, specific deviation inequalities have to be developed for different learning processes. There are many popular deviation and concentration inequalities, e.g., Hoeffding’s inequality, McDiarmid’s inequality, Bennett’s inequality, Bernstein’s inequality, and Talagrand’s inequality, which are valid under the assumption of i.i.d. samples. Moreover, Joulin [23] proposed deviation inequalities for one single sample drawn from a continuous-time Markov chain. However, his results are unsuitable for a sequence of timedependent samples drawn from a continuous-time Markov chain. Here, we extend the deviation results in [23] and present a deviation inequality for the sequence of time-dependent samples drawn from a continuous-time Markov chain. Theorem 1: Assume that f is a bounded λ-Lipschitz function and {zt }t ≥0 is a continuous-time Markov chain with a countable state space ⊂ R K satisfying Conditions C1)–C3). N Let Z1N = {ztn }n=1 be a sample set drawn from {zt }t ≥0 in the time interval [T1 , T2 ]. Define a function F: R K N → R N f (ztn ). F Z1N := n=1

(26)

1876


Then for any t ∈ [T1 , T2 ], we have for any ξ > λt

N (Z) Pr sup F Z − Et F > ξ

can be bounded by using the probability of the event ξ sup E N f − E N f > 2 f ∈F

1

Z∈ N

≤ exp

N Mt(H ) C1 (H ) 2 )

(C2 Ct

C2 Ct(H ) (ξ − λt ) (H )

N Mt

C1 λ

(27)

where (x) := x − (x + 1) ln(x + 1).

(28)

In the above theorem, we show that, given ξ > λt , the probability of the event sup F Z1N − Et(Z) F > ξ Z∈ N

can be bounded by the right-hand side of (27). Compared to the deviation results in the i.i.d. scenario (e.g., [2, Th. 1]), the derived deviation inequality (27) for continuous-time Markov chain is valid not for any ξ > 0 but for any ξ > λt . Next, we present a symmetrization inequality for a sequence of time-dependent samples drawn from a continuous-time Markov chain. B. Symmetrization Inequalities Symmetrization inequalities are mainly used to replace the expected risk by an empirical risk computed on another sample set that is independent of the given sample set but has the same distribution. In this manner, generalization bounds can be achieved on the basis of some kinds of complexity measures, e.g., the covering number and the VC dimension. However, the classical symmetrization results are valid only for the i.i.d. samples (see [2]). Here, we propose a symmetrization inequality for a sequence of time-dependent samples drawn from a continuous-time Markov chain. For clarity of presentation, we give some notations for the N , following discussion. Given a sample set Z1N = {ztn }n=1 N N

we denote Z 1 := {ztn }n=1 be the sample set such that z tn has the same distribution as ztn for any 1 ≤ n ≤ N. The following theorem presents a symmetrization inequality for a sequence of time-dependent samples drawn from a continuoustime Markov chain. Theorem 2: Assume that F is a function class with the range [a, b] and {zt }t ≥0 is a continuous-time Markov chain with a countable state space ⊂ R K satisfying Conditions C1)–C3). Let Z1N and Z 1N be drawn from {zt }t ≥0 in the time interval [T1 , T2 ]. Then for any t ∈ [T1 , T2 ], we have for any ξ > 0 and any N ≥ 8(b − a)2 /ξ 2

(z) Pr sup sup Et f − E N f > ξ f ∈F z∈

≤ 2Pr

ξ

sup E N f − E N f > . 2 f ∈F

(29)

This theorem shows that, given ξ > 0, the probability of the event sup sup Et(z) f − E N f > ξ f ∈F z∈

that is determined by the given sample sets Z1N and Z 1N , when N ≥ 8(b − a)2 /ξ 2 . Interestingly, the derived symmetrization inequality (29) not only has the same form but also is valid in the same condition as that of the classical symmetrization result in the i.i.d. scenario (see [2, Lemma 2]). In fact, as shown in the proof of Theorem 2, the inequality is achieved by using Chebyshev’s inequality that is valid not only in the i.i.d. scenario but also in the non-i.i.d. scenario. By using the derived deviation inequality and the symmetrization inequality, we can obtain generalization bounds of the learning process for a continuous-time Markov chain. IV. G ENERALIZATION B OUNDS FOR C ONTINUOUS -T IME M ARKOV C HAINS In this section, we present generalization bounds of the ERM-based learning process for time-dependent samples drawn from a continuous-time Markov chain. Since the resultant bounds are based on the covering number, we first introduce the definition of the covering number of F , the details of which are given in [27] and [28]. N be a sample set drawn from Definition 3: Let Z1N = {zn }n=1 a distribution Z. For any 1 ≤ p ≤ ∞ and ξ > 0, the covering number of F at radius ξ with respect to L p (Z1N ), denoted by N (F , ξ, L p (Z1N )), is the minimum size of a cover of radius ξ . Subsequently, we come up with the main results of this paper. Theorem 3: Assume that F is a function class composed of λ-Lipschitz functions with the range [a, b], and {zt }t ≥0 ⊂ R K is a continuous-time Markov chain with a countable state space ⊂ R K satisfying Conditions C1)–C3). Let Z1N and Z 1N be drawn from {zt }t ≥0 in the time interval [T1 , T2 ] and denote N

N Z2N 1 := {Z1 , Z 1 }. Then for any t ∈ [T1 , T2 ], we have for any ξ > 8λt /N and any N ≥ 8(b − a)2 /ξ 2

(z) Pr sup sup Et f − E N f > ξ f ∈F z∈

≤ 2E N F , ξ/8, L 1 (Z2N ) 1

(H ) C2 Ct(H ) (Nξ − 8λt ) N Mt C1 × exp (H ) (H ) (C2 Ct )2 8N Mt C1 λ

(30)

where (x) is defined in (28). This theorem interprets that, given ξ > 8λt /N, the probability of the event sup sup Et(z) f − E N f > ξ f ∈F z∈

is bounded by the right-hand side of (30). Note that this result is dependent on the choice of t ∈ [T1 , T2 ]. Next, by combining (11), (19), and Theorem 3, we present a probabilistic inequality of the supremum (12) that is independent of the choice of t, and then achieve the generalization bound “sup f ∈F |E f − E N f |” of the learning process for a continuous-time Markov chain.


Theorem 4: Under notations of Theorem 3, if Conditions C1)–C3) are valid, then for any ξ > 8λ ∗ /N and any N ≥ 8(b − a)2 /ξ 2 , we have

Pr sup E f − E N f > ξ f ∈F

≤ Pr

⎧ ⎪ ⎪ ⎨

sup Et(z) f − E N f > ξ

⎫ ⎪ ⎪ ⎬

0 −2 −4 −6 Γ(x) −8 −10

sup ⎪ ⎪ ⎪ ⎪ ⎩ f ∈F z∈ ⎭ t ∈[T1 ,T2 ] ≤ 2E N F , ξ/8, L 1 (Z2N 1 ) NC1 (1 − e−2H T1 ) H C2 e−H T2 (Nξ − 8λ ∗ ) × exp 2H (C2)2 e−2H T2 4NC1 λ(1 − e−2H T2 ) (31)

where is defined in (28). Proof: According to (16) and (17), we have for any t ∈ [T1 , T2 ] (H ) (32) e−H T2 ≤ Ct ≤ 1 and 1 − e−2H T1 1 − e−2H T2 ≤ Mt(H ) ≤ . (33) 2H 2H Note that (x) is less than zero and monotonically decreasing when x ≥ 0. Therefore, by combining (18), (19), (32), and (33), we have for any t ∈ [T1 , T2 ], any ξ > 8λ ∗ /N, and any N ≥ 8(b − a)2 /ξ 2 C2 Ct(H ) (Nξ − 8λt ) N Mt(H ) C1 (C2 Ct(H ) )2 8N Mt(H ) C1 λ NC1 (1 − e−2H T1 ) 2H C2 e−H T2 (Nξ − 8λ ∗ ) ≤ < 0. 2H (C2 )2 e−2H T2 8NC1 λ(1 − e−2H T2 ) (34) The combination of (30) and (34) can directly lead to the result (31). This completes the proof. This theorem shows that, given ξ > 8λ ∗ /N, the probability of the event sup E f − E N f > ξ f ∈F

is bounded by the right-hand side of the last inequality in (31). As defined in (3), we have obtained the generalization bound of the ERM-based learning process for time-dependent samples drawn from a continuous-time Markov chain satisfying Conditions C1)–C3). Compared to the results of generalization bounds in the i.i.d. scenario (e.g., [27, Th. 2.3]), the derived generalization bound (31) for continuous-time Markov chain is valid not for any ξ > 0 but for any ξ > 8λ ∗ /N, i.e., there is a discrepancy between the empirical risk E N f and the expected risk E f in the learning process of continuous-time Markov chains. However, the discrepancy is controllable and can be prejudged. In fact, since the Markov chain considered in this paper is nonexplosive and has a finite jump [see Condition C2)], there must exist a nonnegative constant C3 such that ∗ /N ≤ C3 for any N > 0.

1877

−12 −14 −16 0

2

4

6

8

10

x Fig. 1.

Function curve of (x).

The derived generalization bound (31) is independent of the choice of the time t and can explicitly reflect its asymptotic behavior when N goes to infinity. In the next section, based on Theorem 4, we will analyze the asymptotic convergence and the rate of convergence of the learning processes for continuous-time Markov chains. V. A SYMPTOTIC C ONVERGENCE In this section, we study the asymptotic convergence and the rate of convergence for the ERM-based learning process for time-dependent samples drawn from a continuous-time Markov chain. We compare the obtained results with those of the learning process for i.i.d. samples and the learning process for β-mixing processes. A. Asymptotic Convergence Recalling (28), it is noteworthy that there is only one solution x = 0 to the equation (x) = 0 and (x) is monotonically decreasing when x ≥ 0 (shown in Fig. 1). Thus, according to Theorem 4, we can obtain the following result which describes the asymptotic convergence of the learning process for a continuous-time Markov chain. Theorem 5: Under the notations and conditions of Theorem 4, if the condition ln E N F , ξ/8, L 1 (Z2N 1 ) < +∞ (35) lim N→+∞ N is supported, then we have for any ξ lim Pr sup E f − E N N→+∞

f ∈F

>

8λ ∗ N

f > ξ

= 0.

(36)

As shown in Theorem 5, if the covering number N (F , ξ/8, L 1 (Z2N 1 )) satisfies (35), the probability of the event sup E f − E N f > ξ f ∈F

will converge to zero for any ξ > 8λ ∗ /N when the sample number N goes to infinity. This is partially in accordance with

1878


the classical result given by [27, Th. 2.3] for the asymptotic convergence of the learning process for i.i.d. samples: the probability of the event sup E f − E N f > ξ

2

f ∈F

x 10−4

1

will converge to zero for any ξ > 0 if the covering number N F , ξ, L 1 (Z1N ) satisfies ln E N F , ξ, L 1 (Z1N ) < +∞. (37) lim N→+∞ N It implies that, in the learning process of continuous-time Markov chains, the uniform convergence of the empirical risk E N f to the expected risk E f may not hold, because the limit (36) does not hold for any ξ > 0 but for any ξ > 8λ ∗ /N. In contrast, the limit (36) holds for all ξ > 0 in the learning process of i.i.d. samples if (37) is satisfied (see [27, Th. 2.3]). Note that, as presented at the end of Section IV, the discrepancy 8λ ∗ /N between E N f and Et f is controllable and can be prejudged. We show that, by ignoring the discrepancy 8λ ∗ /N, the learning process of continuous-time Markov chains has a faster rate of convergence than those of the learning process of i.i.d. samples and the learning process of β-mixing processes. B. Rate of Convergence Subsequently, we present an upper bound of the general ization bound sup f ∈F E f − E N f and analyze the rate of convergence of the learning process for a continuous-time Markov chain. Given a number x > 1, consider the following equation with respect to γ > 0: x − ( x + 1) ln( x + 1) = − xγ and denote its solution as ln ( x + 1) ln( x + 1) − x . γ ( x ) := ln( x)

(38)

γ (x)

0

−1 0

Fig. 2.

100 200 300 400 500 600 700 800 900 x

Function curve of γ (x).

The above theorem shows that, with a probability of at least 1 − 8λ ∗ sup E N f − E f ≤ N f ∈F ⎞ ⎛ ⎛ ⎞ γ1 ξ 2N ⎟ ⎜ ln E N F , 8 , L 1 (Z1 ) − ln(/2) ⎝ ⎠ ⎟. +O ⎜ ⎠ ⎝ N (41) Thus, to find the rate of convergence of the learning processes for continuous-time Markov chains, we have to study the upper bound of the function γ (x) (x > 1). According to (39), for any x > 1, we consider the derivative of γ (x) γ (x) =

(39)

Then, we have, for any 0 < γ ≤ γ ( x) γ x − ( x + 1) ln( x + 1) ≤ − x .

(40)

By combining Theorem 4, (38)–(40), we can straightforwardly show an upper bound of the generalization bound sup f ∈F E N f − E f as follows. Theorem 6: We follow the notations and conditions of Theorem 4. Then, for any ξ > 8λ ∗ /N and any N ≥ 8(b − a)2 /ξ 2 , we have with probability at least 1 − sup E N f − E f f ∈F

⎛

⎞ γ1 ξ 2N ln E N F , ∗ , L (Z ) − ln(/2) 1 1 8 8λ ⎜ ⎟ +⎝ ≤ ⎠ γ −1 −H T N 2 2H C 2 e N

where 0 < γ ≤ γ

8C 2 e−H T2

C 1 (1−e−2H T1 )

H C 2 e−H T2 (Nξ −8λ ∗ ) 4NC 1 λ(1−e−2H T2 )

.

ln(x + 1) ln(x) (x + 1) ln(x + 1) − x ln (x + 1) ln(x + 1) − x − x(ln x)2

(42)

and draw the function curve of γ (x) in Fig. 2. Fig. 2 shows that there is only one solution to the equation x ; we then have γ (x) = 0 (x > 1). Letting the solution be # γ (x) > 0 (1 < x < # x ) and γ (x) < 0 (x > # x ), i.e., γ (x) is monotonically decreasing when x > # x . Meanwhile, by (42), we have (43) lim γ (x) = 0. x→+∞

Furthermore, we study the second derivative of γ

(x) γ

(x) =

ln((x + 1) ln(x + 1) − x) x 2 (ln x)2 1 − (x + 1)(x − (x + 1) ln(x + 1)) ln x +

2 ln((x + 1) ln(x + 1) − x) x 2 (ln x)3


2

1879

1.3

x 10−6

1.5

1.29

1

1.28

0.5

1.27

0

1.26

γ (x) −0.5

γ (x) 1.25

−1

1.24

−1.5

1.23

−2

1.22

−2.5

1.21

−3

Fig. 3.

1.2

100 200 300 400 500 600 700 800 900 x

Function curve of γ

(x).

(ln(x + 1))2 (x − (x + 1) ln(x + 1))2 ln x 2 ln(x + 1) + 2 x(ln x) (x − (x + 1) ln(x + 1))

(44)

and draw the function curve of γ

(x) in Fig. 3. Fig. 3 shows that there is a solution to the equation γ

(x) = 0 and its value is approximately equal to 137.67. Moreover, according to (44), we arrive at lim γ

(x) = 0.

x→+∞

(45)

Therefore, by combining (42)–(45), we obtain that γ (x) has only one global maximum point when x > 1 and denote # x := arg max γ (x). x>1

100

200

300

400

500

600

700

800

900

x Fig. 4.

−

0

Function curve of γ (x).

the Rademacher-complexity-based generalization bound for β-mixing processes (see [10]). By combining [10, Th. 2] and [27, Th. 2.31], we have with probability at least 1 − ⎛ $ +∞ % ⎞ ln N F , ξ, L 2 (Z1N ) dξ 0 ⎠ √ sup E N f − E f ≤ O ⎝ N f ∈F 1 +O √ . (48) N By comparing (47) with (48), we can find that the ERM-based learning process of continuous-time Markov chains can also provide a faster rate of convergence than the learning process of β-mixing processes.

(46)

Our further numerical experiment shows that the value of # x is approximately equal to 69.85 and the maximum of γ (x) (x > 1) is not larger than 1.3 (Fig. 4). Thus, according to (41), we can obtain that, with probability at least 1 − 8λ ∗ sup E N f − E f < N f ∈F ⎛ 1 ⎞ 1.3 ln E N F , ξ/8, L 1 (Z2N ) 1 ⎠. +O ⎝ (47) N Compared to the result of the ERM-based learning process of i.i.d. samples (see [27, Th. 2.3]) ⎛ 12 ⎞ ln E N F , ξ, L 1 (Z1N ) ⎠ sup E N f − E f ≤ O ⎝ N f ∈F by ignoring the discrepancy 8λ ∗ /N, the learning process for continuous-time Markov chains can provide a faster rate of convergence than that of the learning process of i.i.d. samples. Finally, we compare the derived generalization bound for continuous-time Markov chains with

VI. C ONCLUSION In this paper, we mainly studied the generalization bounds of the ERM-based learning process for time-dependent samples drawn from continuous-time Markov chains. By using a martingale method, we provided a deviation inequality for a sequence of time-dependent samples drawn from a continuous-time Markov chain and then obtained a symmetrization inequality for this sequence. Next, we utilized the derived deviation inequality and symmetrization inequality to achieve the generalization bound based on the covering number. Finally, we analyzed the asymptotic convergence and the rate of convergence of the learning process for a continuous-time Markov chain. We found that the asymptotic convergence of the learning process is determined by the complexity of the function class F measured by the covering number. This is partially in accordance with the classical results of the asymptotic convergence of the learning process for i.i.d. samples (see [27]). Although the uniform convergence of the empirical risk to the expected risk may not be supported in this learning process, the discrepancy between the empirical risk and the expected risk is controllable and can be prejudged. Moreover, the rate of convergence of this learning process is faster than

1880


those of the learning process for i.i.d. samples and the learning process for β-mixing processes. It is noteworthy that, by using [27, Th. 2.18], the bounds (30) and (31) can lead to the results based on the fat-shattering dimension. According to [4, Th. 2.6.4], the bounds based on the VC dimension can also be derived from (30) and (31). In future work, we will develop generalization bounds of the learning process for a continuous-time Markov chain by using other complexity measures, e.g., the Rademacher complexity. A PPENDIX P ROOFS OF M AIN R ESULTS In the appendix, we prove Theorems 1–3.

A. Proof of Theorem 1 We prove Theorem 1 by a martingale method. Before the formal proof, we introduce some essential notations. Define a random variable (z) F(Z1N )|Zn1 , 0≤n≤N (49) Sn := Et where Zn1 = {zt1 , . . . , ztn } ⊆ Z1N and Z01 = ∅. It is straight forward to see that S0 = Et(z) F(Z1N ) and SN = F(Z1N ). Then, according to (26) and (49), since f is λ-Lipschitz continuous, we have for any 1 ≤ n ≤ N Sn − Sn−1 = Et(z) F(Z1N )|Zn1 − Et(z) F(Z1N )|Zn−1 1 N

N

n n−1 (z) (z) f (ztm )Z1 − Et f (ztm )Z1 = Et m=1

=

n

(z)

f (ztm ) + Et

m=1

−

n−1 m=1

f (ztm )

= e−αξ

E eα(Sn −Sn−1 ) Zn−1 1

N ' n=1 N '

(z) α f (zt )−Et f +λ ztn −zt E e

−αξ

N N (z) ' α f (zt )−Et f αλ ztn −zt E e E e n=1

N (z) &N ztn −zt α f (zt )−Et f = e−αξ E e eαλ n=1 ln Ee . (54)

(z)

= f (ztn ) − Et f

(z)

= f (ztn ) − f (zt ) + f (zt ) − Et f ≤ λ ztn − zt + f (zt ) − Et(z) f

Since e x −x −1 (x > 0) is a monotonically increasing function, the combination of (52) and (53) leads to the result (51). This completes the proof. Next, we prove Theorem 1. Proof: According to (50), Markov’s inequality, Jensen’s inequality, and the law of iterated expectation, we have for any α > 0 N Pr F Z1N − Et(z) F > ξ ≤ e−αξ E eα F Z1 −Et F &N = e−αξ E eα n=1 (Sn −Sn−1 ) &N = e−αξ E E eα n=1 (Sn −Sn−1 ) Z1N−1 & N−1 = e−αξ E eα n=1 (Sn −Sn−1 ) E eα(S N −S N−1 ) Z1N−1

=e

f (ztm ) f (ztm )

Recalling (14) and letting d be the Euclidean metric on R K , we have | f (z1 ) − f (z2 )| ≤ λ. (53) f Lipd = sup d(z1 , z2 ) z1 =z2

n=1

m=n+1 N (z) + Et m=n

(52)

≤ e−αξ

m=1 N

Proof: The following formula appears in the proof of [23, Th. 3.1]: α f (zt )−E(z) f t (H ) Mt(H ) C1 eα f Lipd C2 Ct ≤ exp E e (H ) 2 (C2 Ct ) −α f Lipd C2 Ct(H ) − 1 .

(50)

where · stands for the Euclidean norm. To prove Theorem 1, we need the following inequality [23]. Lemma 1: Let f : R K → R be a λ-Lipschitz function. Under notations in Theorem 1, there holds that for any α > 0 and any z ∈ (z) α f (zt )−Et f E e

(H ) Mt C1 αλC2 Ct(H ) (H ) e − αλC2 Ct − 1 . × ≤ exp (C2 Ct(H ) )2 (51)

By combining (18), (54), and Lemma 1, we have Pr F Z1N − Et F > ξ ≤ e t (α)−αξ

(55)

where

t (α) =

(H )

C1 (H ) 2 (C2 Ct ) N Mt

(H ) (H ) eαλC2 Ct − αλC2 Ct − 1

+αλt .

(56)

Note that t is infinitely differentiable for α > 0 with

t (α) = and

C1 λ αλC2Ct(H ) e − 1 + λt > 0 (H )

(H )

N Mt

(57)

C2 Ct

(H )

t (α) = N Mt(H ) C1 λ2 eαλC2Ct

> 0.

(58)

Denote ϕt (α) := t (α), and then we minimize the righthand side of (55) with respect to α. According to (57) and (58),


for any ξ > 0, minα>0 { t (α) − αξ } is achieved when ϕt (α) − ξ = 0. By (18) and (57), we have ϕt (0) = λt and ϕt−1 (λt ) = 0. Since t (0) = 0, we arrive at ϕt−1 (ξ )

t ϕt−1 (ξ ) = ϕt (s)ds 0 ξ sdϕt−1 (s) = λt

= ξ ϕt−1 (ξ ) − (λt )ϕt−1 (λt ) − = ξ ϕt−1 (ξ ) −

ξ λt

ξ λt

ϕt−1 (s)ds

ϕt−1 (s)ds.

Thus, for any ξ > λt

min { t (α) − αξ } = − α>0

ξ λt

ϕt−1 (s)ds.

Similarly, we prove that for any z ∈ N (z) N Pr Et F − F Z1 > ξ ≤ exp −

ξ λt

ϕt−1 (s)ds

.

Therefore, we have for any ξ > λt

(z) Pr sup F(Z1N ) − Et F > ξ z∈ N

ξ ≤ exp − ϕt−1 (s)ds λt

(H ) ξ 1 C2 Ct (s − λt ) = exp − ds ln 1 + (H ) (H ) λt λC2 Ct N Mt C1 λ

(H ) (H ) N Mt C1 C2 Ct (ξ − λt ) (59) = exp (H ) (H ) (C2 Ct )2 N Mt C1 λ where (x) = x − (x + 1) ln(x + 1). This completes the proof.

The following is the proof of Theorem 2. Proof: Let A stand for an event, and we denote the indicator function of the event A as 1, if A occurs 1A = 0, otherwise. # For any z ∈ , let f be the function achieving the supremum sup f ∈F Et(z) f − E N f with respect to Z1N . Denote ∧ as the conjunction of two events and we have 1(E(z) # 1 (z) ξ f −E # f )>ξ f −E # f )< (E # N

N

t

2

(z) (z) f −E N # f )>ξ ∧ (E N # f −Et # f )≥− ξ2 (Et #

= 1

≤ 1(E

f −E N # f )> ξ2 N #

.

On the other hand, we have 1(E # 1 (z) # (z) # ξ

# f −E f )>ξ (E f −E f )< N

t

N

=

t

2

1 # (z) # (z) # # ξ (E N f −Et f )>ξ ∧ (Et f −E N f )≥− 2

≤ 1(E

N

Then, taking the expectation with respect to Z 1N gives ξ (z) #

# 1|E(z) # Pr |Et f − E N f | < # t f −E N f |>ξ 2 ξ f − EN # f| > ≤ Pr |E N # . 2

(60)

By Chebyshev’s inequality, we have for any ξ > 0 2

# 4E N Et(z) # f − f (z tn ) ξ (z) Pr |Et # f − E N # f| ≥ ≤ 2 N2ξ 2 2 4N(b − a) 4(b − a)2 ≤ = . (61) N 2ξ 2 Nξ 2 Consequently, according to (60) and (61), we obtain that, for any ξ > 0 ξ Pr |E N # f − EN # f| > 2 4(b − a)2 ≥ 1|E(z) # . (62) 1 − # t f −E N f |>ξ Lξ 2 Finally, letting 4(b − a)2 1 ≤ Nξ 2 2 we obtain the result (29) by taking the expectation with respect to Z1N . This completes the proof. C. Proof of Theorem 3 Finally, we prove Theorem 3. N as independent Rademacher Proof: Consider {n }n=1 random variables, i.e., independent {−1, 1}-valued random variables with equal probability of taking either value. Given N and Z2N {n }n=1 1 , for any f ∈ F , we denote → − (63) := ( , . . . , , − , . . . , − )T 1

N

1

N

and T − → 2N f (Z1 ) := f (z t1 ), . . . , f (z t N ), f (zt1 ), . . . , f (zt N ) .

B. Proof of Theorem 2

t

1881

. # f −E N # f )> ξ2

(64) According to (6) and Theorem 2, for any z ∈ and any t ξ > 8λ N , we have

(z) Pr sup Et f − E N f > ξ f ∈F

≤ 2Pr = 2Pr

ξ sup E N f − E N f > 2 f ∈F

N 1 ξ f (z tn ) − f (ztn ) > sup 2 f ∈F N

n=1

N 1 ξ n f (z tn ) − f (ztn ) > = 2Pr sup 2 f ∈F N n=1

1 ( ) ξ − → → − = 2Pr sup , f (Z2N . 1 )> 4 f ∈F 2N

(65)

Fix a realization of Z2N 1 and set to be a ξ/8-radius cover of F with respect to the L 1 (Z2N 1 ) norm. Since F is composed

1882


of λ-Lipschitz functions, we assume that the same holds for any h ∈ . According to the triangle inequality, ( → − ) if f 0 is the → 1 − > ξ/4, there , f (Z2N function that achieves sup f ∈F 2N ) 1 must be an h 0 ∈ that satisfies N ξ 1 | f 0 (z tn ) − h 0 (z tn )| + | f 0 (ztn ) − h 0 (ztn )| < 2N 8 n=1

and meanwhile ) ξ 1 (− − → → , h 0 (Z2N 1 ) > . 2N 8 Therefore, for the realization of Z2N 1 , we arrive at

1 ( → 2N ) ξ − , − → f (Z1 ) > Pr sup 4 f ∈F 2N 1 ( ) ξ − → → − > , h (Z2N ) ≤ Pr sup . 1 8 h∈ 2N

(66)

Moreover, we denote the event 1 ( ) ξ − → → − ) A := Pr sup > , h (Z2N 1 8 h∈ 2N and let 1A be the indicator function of the event A. By Fubini’s Theorem, we have 2N → Pr{A} = E E− 1A Z1 1 ( ) ξ 2N − → → − Z > , h (Z2N = E Pr sup ) . (67) 1 1 8 h∈ 2N Fix a realization of Z2N 1 . For any t ∈ [T1 , T2 ] and any ξ > 8λt /N, according to (63), (64), and Theorem 1, we have for any z ∈ 1 ( − → 2N ) ξ → − Pr sup , h (Z1 ) > 8 h∈ 2N ) ξ → − 1 (− → 2N ≤ || max Pr , h (Z1 ) > h∈ 2N 8 ξ ξ 2N

= N F , , L 1 (Z1 ) max Pr E N h − E N h > h∈ 8 4 ξ ≤ N F , , L 1 (Z2N 1 ) 8 ξ × max Pr |Et(z) h − E N h| + |Et(z) h − E N h| > h∈ 4 ξ ξ (z) = N F , , L 1 (Z2N 1 ) max Pr Et h − E N h > h∈ 8 8 ξ 2N ≤ N F , , L 1 (Z1 ) 8

N Mt(H ) C1 C2 Ct(H ) (Nξ − 8λt ) . (68) × exp (H ) (H ) (C2 Ct )2 8N Mt C1 λ The combination of (65), (66), and (68) leads to the result (30). This completes the proof. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers and the editors for their valuable comments and suggestions.

R EFERENCES [1] V. N. Vapnik, “An overview of statistical learning theory,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988–999, Sep. 1999. [2] O. Bousquet, S. Boucheron, and G. Lugosi, “Introduction to statistical learning theory,” in Advanced Lectures on Machine Learning, O. Bousquet, Ed. New York: Springer-Verlag, 2004, pp. 169–207. [3] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [4] A. van der Vaart and J. Wellner, Weak Convergence and Empirical Processes with Applications to Statistics (Hardcover). New York: Springer-Verlag, 2000. [5] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, “Learnability and the Vapnik–Chervonenkis dimension,” J. ACM, vol. 36, no. 4, pp. 929–965, 1989. [6] D. Zhou, “Capacity of reproducing kernel spaces in learning theory,” IEEE Trans. Inf. Theory, vol. 49, no. 7, pp. 1743–1752, Jul. 2003. [7] C. Zhang, W. Bian, D. Tao, and W. Lin, “Discretized-Vapnik– Chervonenkis dimension for analyzing complexity of real function classes,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 9, pp. 1461–1472, Sep. 2012. [8] P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local Rademacher complexities,” Ann. Stat., vol. 33, no. 4, pp. 1497–1537, 2005. [9] Z. Hussain and J. Shawe-Taylor, “Improved loss bounds for multiple kernel learning,” J. Mach. Learn. Res., vol. 15, pp. 370–377, Dec. 2011. [10] M. Mohri and A. Rostamizadeh, “Rademacher complexity bounds for non-i.i.d. processes,” in Proc. Neural Inf. Process. Syst., 2008, pp. 1097– 1104. [11] M. Mohri and A. Rostamizadeh, “Stability bounds for stationary ϕmixing and β-mixing processes,” J. Mach. Learn. Res., vol. 11, pp. 798–814, Feb. 2010. [12] C. Zhang and D. Tao, “Risk bounds for Lévy processes in the PAClearning framework,” J. Mach. Learn. Res., vol. 9, pp. 948–955, Dec. 2010. [13] B. Yu, “Rates of convergence for empirical processes of stationary mixing sequences,” Ann. Probab., vol. 22, no. 1, pp. 94–116, 1994. [14] C. Houdré and P. Marchal, “Median, concentration and fluctuations for Lévy processes,” Stochastic Processes Appl., vol. 118, pp. 852–863, Aug. 2008. [15] K.-R. Müller, A. Smola, G. Rätsh, B. Schölkopf, J. Kohlmorgen, and V. Vapnik, “Predicting time series with support vector machines,” in Proc. 7th Int. Conf. Artif. Neural Netw., 1997, pp. 1–7. [16] S. Mukherjee, E. Osuna, and F. Girosi, “Nonlinear prediction of chaotic time series using support vector machines,” in Proc. IEEE Workshop Neural Netw. Signal Process., Sep. 1997, pp. 511–520. [17] K.-J. Kim, “Financial time series forecasting using support vector machines,” Neurocomputing, vol. 55, nos. 1–2, pp. 307–319, 2003. [18] M. Biguesh and A. B. Gershman, “Training-based MIMO channel estimation: A study of estimator tradeoffs and optimal training signals,” IEEE Trans. Signal Process., vol. 54, no. 3, pp. 884–893, Mar. 2006. [19] D. Love, R. Heath, V. Lau, D. Gesbert, B. Rao, and M. Andrews, “An overview of limited feedback in wireless communication systems,” IEEE J. Sel. Areas Commun., vol. 26, no. 8, pp. 1341–1365, Oct. 2008. [20] A. Tulino, A. Lozano, and S. Verdú, “Impact of antenna correlation on the capacity of multiantenna channels,” IEEE Trans. Inf. Theory, vol. 51, no. 7, pp. 2491–2509, Jul. 2005. [21] M. Sanchez-Fernandez, M. de-Prado-Cumplido, J. Arenas-Garcia, and F. Perez-Cruz, “SVM multiregression for nonlinear channel estimation in multiple-input multiple-output systems,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2298–2307, Aug. 2004. [22] A. Sutivong, M. Chiang, T. M. Cover, and Y.-H. Kim, “Channel capacity and state estimation for state-dependent Gaussian channels,” IEEE Trans. Inf. Theory, vol. 51, no. 4, pp. 1486–1495, Apr. 2005. [23] A. Joulin, “Poisson-type deviation inequalities for curved continuoustime Markov chains,” Bernoulli, vol. 13, no. 3, pp. 782–798, 2007. [24] M. Chen, From Markov Chains to Non-Equilibrium Particle Systems, 2nd ed. Singapore: World Scientific, 2004. [25] Y. Ollivier, “Ricci curvature of Markov chains on metric spaces,” J. Funct. Anal., vol. 256, no. 3, pp. 810–864, 2009. [26] Q. Zhu and J. Cao, “Stability analysis of Markovian jump stochastic BAM neural networks with impulse control and mixed time delays,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 3, pp. 467–479, Mar. 2012. [27] S. Mendelson, “A few notes on statistical learning theory,” in Advances in Lectures on Computer Science, vol. 2600. New York: Springer-Verlag, 2003, pp. 1–40. [28] D. Zhou, “The covering number in learning theory,” J. Complexity, vol. 18, no. 3, pp. 739–767, 2002.


Chao Zhang received the Bachelors and Ph.D. degrees from Dalian University of Technology, Dalian, China, in 2004 and 2009, respectively. He is currently pursuing the Post-Doctoral degree with the Center for Evolutionary Medicine and Informatics, Biodesign Institute, Arizona State University, Tempe. He was a Research Fellow with the School of Computer Engineering, Nanyang Technological University, Singapore, from 2009 to 2011. His current research interests include neural networks, machine learning, and statistical learning theory.

1883

Dacheng Tao (SM’12) is a Professor of computer science with the Centre for Quantum Computation & Intelligent Systems and the Faculty of Engineering & Information Technology, University of Technology, Sydney, Australia. He specializes in applied statistics and mathematics for data analysis problems in data mining, computer vision, machine learning, multimedia, and video surveillance. He has authored or co-authored more than 100 scientific articles at top venues, including the IEEE T RANS ACTIONS ON PATTERN A NALYSIS AND M ACHINE I NTELLIGENCE, the IEEE T RANSACTIONS ON I MAGE P ROCESSING, the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS , AISTATS, ICDM, CVPR, ECCV, ACM Multimedia and KDD. He was a recipient of the Best Theory/Algorithm Paper Runner Up Award from IEEE ICDM’07.

SHARP ENTRYWISE PERTURBATION BOUNDS FOR MARKOV CHAINS.

Generalization Bounds for Domain Adaptation.

Generalization Bounds Derived IPM-Based Regularization for Domain Adaptation.

Refined Generalization Bounds of Gradient Learning over Reproducing Kernel Hilbert Spaces.

Universal recovery map for approximate Markov chains.

Multitask Classification Hypothesis Space With Improved Generalization Bounds.

Sample Complexity Bounds for Differentially Private Learning.

Generalization performance of Gaussian kernels SVMC based on Markov sampling.

Honest Importance Sampling with Multiple Markov Chains.

Generalization performance of Fisher linear discriminant based on Markov sampling.

Likelihood free inference for Markov processes: a comparison.

Bayesian inference for Markov jump processes with informative observations.

Local Composite Quantile Regression Smoothing for Harris Recurrent Markov Processes.

Computer use changes generalization of movement learning.

Extreme learning machine for ranking: generalization analysis and applications.

Markov chains and semi-Markov models in time-to-event analysis.

Compressed classification learning with Markov chain samples.

The generalization performance of regularized regression algorithms based on Markov sampling.

A methodology for stochastic analysis of share prices as Markov chains with finite states.

Estimates and Standard Errors for Ratios of Normalizing Constants from Multiple Markov Chains via Regeneration.

Efficient maximum likelihood parameterization of continuous-time Markov processes.

Explaining compound generalization in associative and causal learning through rational principles of dimensional generalization.

Processes underlying generalization through participant modeling with self-directed practice.

The generalization ability of online SVM classification based on Markov sampling.