IET Systems Biology Special Issue: Computational Models and Methods in Systems Biology and Medicine

Structural and practical identifiability analysis of S-system

ISSN 1751-8849 Received on 27th February 2015 Revised on 3rd August 2015 Accepted on 14th August 2015 doi: 10.1049/iet-syb.2015.0014 www.ietdl.org

Choujun Zhan 1, Benjamin Yee Shing Li 2 ✉, Lam Fat Yeung 2 1

Department of Electronics Communication and Software Engineering, Nanfang College of Sun Yat-Sen University, Guangdong 510970, People’s Republic of China 2 Department of Electronic Engineering, City University of Hong Kong, Hong Kong, Hong Kong ✉ E-mail: [email protected]

Abstract: In the field of systems biology, biological reaction networks are usually modelled by ordinary differential equations. A sub-class, the S-systems representation, is a widely used form of modelling. Existing S-systems identification techniques assume that the system itself is always structurally identifiable. However, due to practical limitations, biological reaction networks are often only partially measured. In addition, the captured data only covers a limited trajectory, therefore data can only be considered as a local snapshot of the system responses with respect to the complete set of state trajectories over the entire state space. Hence the estimated model can only reflect partial system dynamics and may not be unique. To improve the identification quality, the structural and practical identifiablility of S-system are studied. The S-system is shown to be identifiable under a set of assumptions. Then, an application on yeast fermentation pathway was conducted. Two case studies were chosen; where the first case is based on a larger state trajectories and the second case is based on a smaller one. By expanding the dataset which span a relatively larger state space, the uncertainty of the estimated system can be reduced. The results indicated that initial concentration is related to the practical identifiablity.

1

Introduction

Benefiting from the technological breakthrough of experimental biology, large volume and variety of biological data are able to be collected [1]. This data is important for the study of cellular processes and underlying molecular events, and is essential for quantitative study and simulation of biological systems [2]. A standard approach to investigate complex dynamics, non-linear interaction mechanisms in cellular processes is to characterise the biological system as mathematical models [3, 4]. One of the commonly used approaches is to model such dynamics via ordinary differential equations (ODEs) [3, 5, 6]. Although the basic mathematical structure is known, very often the experimental results are only partially available. This incompleteness of information introduce difficulties to system identification [2, 7]. Common approach is to formulate this type of problem as an optimisation problem to calibrate the unknown parameters in the sense that the output of the mathematical model is in good alignment with the measured data [8]. In this context, a number of parameter estimation methods are developed, for instance, decoupling [9–11], immune algorithm [12], genetic algorithm [13, 14] etc. [15–18]. While applying these parameter estimation methods, a strong assumption is being made: the parameters of the model can be uniquely determined under the ideal conditions of noise-free data and error-free model structure [19–22]. This is the a priori condition on structure identification. If the mathematical model itself is not identifiable, due to the absence of uniqueness of parameters, estimation performance can hardly be improved from the algorithmic prospective. Structural identifiability itself only does not guarantee an accurate identification of model from experimental data. Moreover, in practice, experimental data is usually sparse and corrupted by high-level of noise, combine with the high complexity of the model, some of the parameters may become unidentifiable [23]. Conventional study on structural identifiability focuses on the cases where infinity noise-free data is given, this ideal situation may be infeasible when investigating the identifiability in a

IET Syst. Biol., 2015, Vol. 9, Iss. 6, pp. 285–293 & The Institution of Engineering and Technology 2015

practical situation [24]. Subsequently, numerous methods have been developed for analysing the structural and also its practical identifiability of various system [19–21, 25–29], however the well-known S-system model has not yet been covered. In this paper, the structural and practical identifiability of S-system is analysed based on our previous work [30]. The analysis shows that the S-system is structurally identifiable but its practical identifiability is not guaranteed. The practical identifiability is found to be highly correlated to the initial concentrations of the nominal model. To illustrate this property, the yeast fermentation pathway model is employed as a benchmark model. In this study two cases are considered: (i) an ideal case where sufficiently large noise-free data is given and (ii) a pseudo-practical case with smaller dataset generated by limiting the initial concentration region of the nominal model. The parameter estimation problem is formulated as a least square problem and solved via a hybrid method based on spline smoothing technique and differential evolution (DE) algorithm [18]. According to the results, in case (i), the topology and parameters of yeast fermentation pathway can be identified simultaneously. However, in case (ii), the identified model may encounter an over-complexity problem, namely, the identified model contains too many erroneous interactions having large magnitude in corresponding parameters. Under this context, these erroneous interactions cannot be ignored. To overcome this problem, regularisation can be introduced by adding penalty terms into the objective function to control the complexity of the model. However, the estimated model may suffer from another issue, the under-complexity problem. That is some interactions which actually exist are being removed from the identified model due to low magnitude of its corresponding parameters. Therefore, the underlying S-system cannot be identified uniquely. This uncertainty can be narrowed by expanding the captured dataset which cover a relatively larger state space. The paper is organised as following: in Section 2, preliminary knowledge of S-system model, structural identifiability and practical identifablity are introduced. We also show that S-system is structurally identifiable in idea case; in Section 3, we utilise

285

yeast fermentation path way model as a benchmark model to analyse the structure and practical identifiability. In Section 3.1, idea case with sufficient data is simulated, while pseudo-practical case with insufficient data is provided in Section 3.2.

2

n g n h   xj i,j , Hi (x) W bi xj i,j , Fi (x) W Gi (x)−Hi (x), Define Gi (x) W ai j=1 j=1 n g˜ n h˜   xj i,j , H˜ i (x) W b˜ xj i,j and F˜i (x) W G˜ i (x) − H˜ i (x), (i = 1, G˜ i (x) W a˜ j=1

j=1

2, …, n). Lemma 1: If Fi (x) = F˜ i (x), then,

Identifiability problem of S-systems

m ˜ ˜ ˜ ˜m Gi (x) − hm gi,j i,j Hi (x) = g i,j Gi (x) − hi,j H i (x), m

2.1

Structural identifiability

In this section, the structural identifiability of S-systems is discussed, in addition a theoretical analysis is conducted. The concept of structural and practical identifiability has been proposed in [19–21]. A dynamic system can be described by a set of ODEs x˙ (t) = f (x(t), u(t), u), x(t0 ) = x0 , y(t) = g(x(t), u) + v(t),

j=1

g

xj i,j − bi

n 

h

xj i,j ,

xi (0) = xi,0 ,

(2)

j=1

where n is the number of reaction species, xi denotes the concentration of the ith metabolite (i ∈ {1, 2, …, n}), x0 = {x1,0, x2,0, …, xn,0} is the initial concentration vector at t0, and i, j ∈ {1, 2, …, n} are suffixes of state variables. Parameters αi and βi are non-negative rate constant. The real-valued gi,j and hi,j are referred as kinetic orders. An S-system model may include independent variables, which are typically known as prior information. Hence, independent variables are usually merged with the rate constants αi and βi [6]. The biochemical network are generally sparsely connected [31], when an interaction between two species exist, that is, jth specie directly activates (or inhibits) the ith specie, the parameter value corresponding to this interaction (gi,j, hi,j ) is non-zero, otherwise it is denoted as zero. In summary, the parameters of interest in a typical S-system are θ = {α, β, g, h}. Let the input u(t) = 0, fx0 (u) be the trajectory of the S-system (2) with initial state x0 and parameter set θ [9, 11]. Here, we evaluate the estimated model by considering whether it can reproduce all the trajectories generated by (2) or not based on the underlying x0 and θ. The definition of structural identifiability of S-systems is given as follows [21]. Definition 2: (Structural Identifiability of S-systems) An S-system (2) is a structurally (priori) identifiable for at least a generic set of point θ* ∈ Θ, there exist

fx0 (u) = fx0 (u∗ ),

Lemma 2: If Fi (x) = F˜ i (x), for i = 1, 2, …, n, then,

(gi,j − hi,j )(˜gi,j − hi,j )(h˜ i,j − hi,j )hm i,j Hi = 0,

(1)

Definition 1: (S-system) An S-system can be defined as follow. n 

where m = 0, 1, 2, … and i = 1, 2, …, n.

m Gi = 0, (hi,j − gi,j )(˜gi,j − gi,j )(h˜ i,j − gi,j )gi,j

where θ ∈ Rp is a vector of parameters associated with the state transition and control dynamics and the observation function, x(t) ∈ Rn is the state vector, u(t) ∈ Rq is the input vector, y(t) ∈ Rm is the vector of observed state components or functions and ω(t) ∈ Rm represents noise. Let fx0 (u, u) be the input-output mapping of the system (1) started at the initial state x0 with parameter θ. Very often, many biological systems possess special structure in the form of power law. In addition, a major advantage of S-system is that it uniquely maps dynamics and topology onto its parameters. Hence the general form (1) can be structuralised into the well-known S-system representation.

x˙ i = ai

(4)

(3)

has only one solution for all initial states x0 ∈ X⊆Rn. According to this definition, if an S-system is unidentifiable, (2) will have more than one solution [21]. However, in the following we will show that this is not the case. Before the discussion of the structural identifiability of S-system, two lemmas are needed.

˜ (h˜ i,j − g˜ i,j )(gi,j − g˜ i,j )(hi,j − g˜ i,j )˜gm i,j Gi = 0, (˜gi,j − h˜ i,j )(gi,j − h˜ i,j )(hi,j − h˜ i,j )h˜ i,j H˜ i = 0. m

Theorem 1: (Structural Identifiability Theorem of S-systems) For system (2) with output y(t) = x(t), this S-system is identifiable (A brief proof from [30] is reproduced for convenience in Appendix). Given a biological reaction network which can be modelled by S-system, with the identifiability Theorem 1, the parameter set θ is guaranteed to be unique in ideal cases of absence of noise with infinite and perfect measured data. Then, we can prune the interactions with parameters gi,j = 0 or hi,j = 0 and the topology of biological networks can be identified simultaneously. Such topology of biological reaction networks plays an important role in the analysis and understanding of biological process at system level [32]. 2.2

Practical identifiability

Refer to the S-system (2), given N pairs of input-output measurement, the cost function can be written in the sense of average weighted square prediction error f (uˆ , xˆ 0 ) =

M N −1  

(x(k) (tj ) − xˆ (k) (tj |uˆ ))T Wj (x(k) (tj ) − xˆ (k) (tj |uˆ )). (5)

k=1 j=0

Thus the identification problem can be formulated as a non-linear programming problem P0 : min f (uˆ , xˆ 0 ) uˆ ,ˆx0

s.t.

⎧ n g n h   ⎪ ⎪ xj i,j − bi xj i,j , ⎪ x˙ i = ai ⎨ j=1

j=1

x(t0 ) = xˆ 0 ,

⎪ xˆ (k) (tj |uˆ ) = x(tj |x(k) k = 1, . . . , M , ⎪ 0 ), ⎪ ⎩ ˆ uL ≤ u ≤ uU ,

where Wj are positive semi-definite weights, xˆ 0 is the estimated initial concentration, xˆ [ Rn are the estimated system states, x (k)(tj ) is the measured data of the kth experiment at time tj, uˆ = {ai , bi , gi,j , hi,j i, j = 1, 2, . . . , n} is the set of estimated parameters and xˆ (k) (tj |uˆ ) represents the estimated variable at time tj with parameter uˆ and initial condition xˆ 0 . θL and θU are simple structural constraints such as the parameter’s upper/lower bounds. The system (or the parameter θ) is practical identifiable if f (uˆ , xˆ 0 ) has a unique minimum, in other words there is a unique minimum prediction error estimated [20]. In this context, the identification procedure can be divided into three phases:

IET Syst. Biol., 2015, Vol. 9, Iss. 6, pp. 285–293

286

& The Institution of Engineering and Technology 2015

(i) the approximation phase, (ii) the estimation phase and (iii) the prune phase. † Approximation phase In this phase, spline interpolation is employed to estimate the trajectory and the derivative of the ODEs model and we remove the need for ODE solver during the identification process. This is conceptually similar to the decoupling approaches [9–11], but is solved via a non-linear constraint optimisation problem. † Estimation phase In this phase, we selected a modified DE algorithm to solve this problem. DE optimises a problem by randomly generating some initial seeds within the searching region. Then, DE maintains a population of candidate solutions and creating new candidate solutions by combining existing ones according to its rules of evolution. Candidate solution with the best score or fitness on the optimisation problem will be kept. Differ from the conventional DE algorithm, in this implementation, a set of weightings is introduced to control the evolution rate to improve the overall computational complexity of the searching process. A detailed description of this algorithm can be found in [18]. † Prune phase For S-system, if there is no interaction between two biochemical species, the system parameter value corresponding to the interaction (i.e. kinetic orders gi,j or hi,j ) is zero. As biochemical networks are often sparsely connected, a structural skeletonising procedure is commonly employed by setting some parameters of less than a given threshold to zero. The main idea is that if the magnitude of an interaction is below certain threshold, that is, |gi,j| ≤ δs (|hi,j| ≤ δs), the corresponding interaction is considered to be absent and removed [13, 33].

3 Analysing the structural and practical identifiability of yeast fermentation pathway model To verify the quality of the mentioned identifiablity, a nominal model with known structure and parameters is selected. The first part (Section 3.1) describes structural identification which would be used to compare with the practical identification given in Section 3.2. On the practical identification case, the over-complexity and the under-complexity problem are addressed. From these results, it shows that, in the identification case, by expanding the dataset which spans a relatively larger state space, the uncertainty of the estimated system can be reduced. One of the purposes of this case study is to demonstrate the importance of the size and the span of the dataset on the quality of the identification. Obviously, measurement noise will further deteriorate the identification accuracy. This issue was addressed in other previous works and therefore is omitted here [17, 18]. Here, we consider an S-system model of the ethanol production by yeast as a nominal model for analysing the structural and practical identifiability problems. The yeast fermentation pathway model is a biochemical networks and has been extensively studied [34, 35].

This model refers to anaerobic and non-growing conditions with glucose as the sole carbon source in the absence of nitrogen. This metabolic pathway involves five dependent variables, glucose (x1), frutose1, 60diphosphate (x3), glucose-6-phosphate (x2), phosphoenolpyruvate (x4) and ATP (x5), and eight independent variables with steady-state value, glucose uptake (x6), hexokinase (x7), phosphofructorkinase (x8), glyceraldehyde 3-phosphate dehydrogenase (x9), pyruvate kinase (x10), polysaccharide storage (x11), glycerol production (x12) and ATPase (x13). The ODE model of yeast fermentation pathway is g

h

h

x˙ 1 = a1 x21,2 x6 − b1 x11,1 x51,5 x7 , g

g

h

h

g

g

h

h

h

g

g

h

h

h

g

g

x˙ 2 = a2 x12,1 x52,5 x7 − b2 x22,2 x52,5 x0.8322 x0.1678 , 8 11 x˙ 3 = a3 x23,2 x53,5 x8 − b3 x33,3 x43,4 x53,5 x0.8547 x0.1453 , 9 12 x˙ 4 = a4 x34,3 x54,5 x9 − b4 x34,3 x44,4 x54,5 x10 , g

h

h

h

0.5 5,1 5,2 5,5 0.3514 0.2925 0.0589 0.297 x˙ 5 = a5 x35,3 x45,4 x55,5 x0.5 x8 x11 x13 , 9 x10 − b5 x1 x2 x5 x7

(6) where αi, βi, gi,j and hi,j are the unknown parameters and the nominal parameter values are shown in Table 1. We can evaluate the identified model by considering how well it can reproduce the given time-series data. Here, we use the mean absolute percentage error (MAPE) [36] as a criteria to evaluate identified model. MAPE is defined as following



















n  N 1  xi (tj ) − xˆ i (tj )

MAPE = , N · n i=1 j=0 xi (tj ) where x(ti) the measured data, xˆ (ti ) represents the estimated time-course produced by the identified model at time ti. Note that smaller MAPE reflects better estimation and means that the identified model can capture the measured data accurately. 3.1 Structure identifiability analysis of yeast fermentation pathway model A total of 12 sets of time course data are generated. The initial concentrations x0 is generated from the uniform distribution on the open interval [0, 1200]. Each initial condition is used to generate 60 time-course data. According to previous literature, the searching ranges are [0, 3] for αi and βi, [−1, 1] for gi, j and hi,j, respectively [6]. However, when αi and βi go to zero, all the parameters of Gi (x) would result unidentifiable. Hence, instead of using the search region [0, 3], in this paper, we use [e −11, e]. The mean of estimated parameters is listed in Table 1. The identified parameters are clearly close to the nominal values. Fig. 1a shows the nominal value (black dot), the mean of estimated parameter

Table 1 Identified parameters versus nominal parameters of the yeast fermentation pathway model i

αi

aˆ i

gi,1

gˆ i,1

gi,2

gˆ i,2

1 2 3 4 5 i 1 2 3 4 5 MAPE

1.0006 1.6497 0.4536 0.2365 1.406 βi 1.6497 0.5973 0.2456 2.0892 2.9437

0.9943 1.6402 0.4445 0.2284 1.3138 bˆ i 1.64 0.594 0.2401 2.0499 2.6532

0.0000 0.5582 0.0000 0.0000 0.0000 hi,1 0.5582 0.0000 0.0000 0.0000 0.1962

−0.0059 0.5592 −0.0004 −0.0009 −0.0091 hˆ i,1 0.5589 −0.0013 0 −0.0014 0.2063

−0.0492 0.0000 0.4407 0.0000 0.0000 hi,2 0.0000 0.5097 0.0000 0.0000 0.1791

−0.0479 0.0000 −0.0007 0.0000 0.4568 0.0000 0.0006 0.5285 −0.0091 0.2605 hi,3 hˆ i,2 0.0006 0.0000 0.5107 0.0000 −0.0027 0.4506 0.0012 −0.0075 0.1876 0.0000 0.0096 ± 0.0067

IET Syst. Biol., 2015, Vol. 9, Iss. 6, pp. 285–293 & The Institution of Engineering and Technology 2015

gi,3

gˆ i,3

gi,4

gˆ i,4

gi,5

gˆ i,5

−0.0005 0.0004 −0.0145 0.5369 0.2724 hˆ i,3 0 0.0001 0.4538 −0.0126 −0.0136

0.0000 0.0000 0.0000 0.0000 0.152 hi,4 0.0000 0.0000 0.0441 0.304 0.0000

−0.0011 −0.0003 −0.0015 −0.0043 0.1618 hˆ i,4 −0.0003 −0.0001 0.0442 0.3069 −0.0053

0.0000 0.0465 −0.2665 0.0994 0.0739 hi,5 0.0465 −0.2218 0.092 0.0484 0.2354

−0.0006 0.0469 −0.2665 0.0992 0.064 hˆ i,5 0.0465 −0.2222 0.0945 0.0464 0.2414

287

Fig. 1 Mean estimated parameter with variance: the black dots, cyan bars and red lines represent the nominal parameters, means of estimated parameters and the standard deviation of estimated parameters, respectively a The measurement data are sufficient and the identified model is accurate b The measurement data are insufficient and the identified parameters are clearly different from the nominal values

(cyan bar) and the standard deviation of estimated parameters (red line) in the format of error bar. Note that there are several weak connections, such as g1,2 = −0.0492, h4,3 = −0.0075, and h1,5 = −0.0484. Other connections, such as g5,3 = 0.2605, are almost 10 times stronger than these weak connections. The identification results show that for every disconnected pair of biochemical species (gi,j = 0 or hi,j = 0), the absolute mean of estimated gˆ i,j or hˆ i,j are all smaller than 0.014 (hˆ 5,3 = 0.0136). The less important connections with non-zero parameters are detected even though their order are around 10−2 (ˆg1,2 = 0.0479, hˆ 1,5 = 0.0465, hˆ 4,3 = −0.0126 and hˆ 4,5 = 0.0464). For the connected pairs of biochemical species (gi,j ≠ 0, or hi,j ≠ 0), almost all |gi,j| and |hi,j| are larger than 0.046. Note that these connections can be lost during the explicit skeletalising procedure, in which some kinetic orders near zero (or below a threshold) are set to zero. However, estimated results shown that, if the skeleton threshold is selected to be δs = 0.03, after pruning small parameters, only one connection (h4,3 = −0.0075) was incorrectly been removed. To provide a statistical view of the estimation performance, 1000 different time-courses were generated with a random initial region within [0, 1200]. Then, the mean and deviation of MAPE were calculated. The estimation result is summarised in Table 1. The MAPE is about 0.0096 ± 0.0067, namely, the difference between the output generated by the estimated model and the output generated by the prior known model is about 1% on average, which is minor. The average outputs and also the deviation between outputs, generated by the estimated model and nominal model, are calculated as following

95% confidence level. The two average trajectories are nearly identical to each other and the deviation region is small. Hence, we enlarge a part and show it in the inset of Fig. 2. From simulation and the structural identifiability theorem of S-systems, if a sufficient amount of noise free data are given, we have confidence that this identification approaches have an ability to identify a correct S-system model. 3.2 Practical identifiability of yeast fermentation pathway model In practical applications, some constraints must be considered: (i) the sets of time-series data could be obtained by actual biological experiments under different experimental conditions. However, the variation is limited in small region [6]. In the previous section, the initial condition x0 belongs to the region [0, 1200], which is very large;

⎧ M ⎪ 1 ⎪ ⎪ xj (t), ⎪ xi (t) = ⎪ ⎨ M j=1 i M ⎪ 1 ⎪ ⎪ ⎪  x˜ ji (t), ⎪ ⎩ xe,i (t) = M j=1

where xi (t), and xe,i (t) are the mean value of the ith variable of output generated by the nominal model and identified model at time t, respectively; Δxi (t) can be treated as the deviation; M is the repeated times and j = 1, 2, …, M. Here, M = 1000. The average trajectories of the nominal model xi (t) and the identified model xe,i (t) are given in Fig. 2, also with the confidence interval at a

Fig. 2 Dots represent the average trajectories generated by the nominal model, and solid lines represent the estimated profile generated by the identified model. The green region is the deviation region. The inset is a part of profiles of x5

IET Syst. Biol., 2015, Vol. 9, Iss. 6, pp. 285–293

288

& The Institution of Engineering and Technology 2015

Table 2 Over-complexity model (every node is connected) versus under-complexity model of the yeast fermentation pathway Over Complexity i

aˆ i

gˆ i,1

gˆ i,1

1 2 3 4 5

0.6654 1.4633 0.8917 0.1952 0.8633 bˆ i 0.9864 0.5203 0.5039 1.8815 1.2148

−0.1793 0.6221 −0.0207 −0.0098 −0.0852 hˆ i,1 0.7658 −0.0714 −0.0204 −0.0055 0.3084

−0.059 −0.0042 −0.0617 0.007 0.3477 0.0897 0.0119 0.5975 −0.0676 0.382 hˆ i,2 hˆ i,3 0.0248 −0.0039 0.5704 0.0088 0.1085 0.3327 0.0092 −0.0547 0.2968 −0.1381 0.0122 ± 1.7 × 10−5

1 2 3 4 5 MAPE

gˆ i,1

Under Complexity gˆ i,1

gˆ i,1

aˆ i

0.0051 −0.0033 0.0136 −0.0324 0.2219 hˆ i,4 0.0046 −0.0041 0.0381 0.3365 −0.0817

−0.0259 0.0809 −0.1979 0.097 −0.0016 hˆ i,5 0.0533 −0.2512 −0.0039 0.0369 0.3194

0.9877 0 1.6523 0.5573 0.4530 0 0.2520 0 1.3923 0 bˆ i hˆ i,1 1.6240 0.5788 0.5984 0 0.2277 0 2.1601 0 2.8986 0.1983 0.0167 ± 0.0021

(ii) due to practical limitations, usually only an insufficient amount of data and parts of the model variables are directly measurable, that is, not all species incorporated in a model can be measured directly [37]; (iii) experimental data may be perturbed by measurement error [35]. Furthermore, the degree-of-freedom of S-systems is high. A system that is structurally identifiable may still be practically non-identifiable, because there may exist multiple candidate solutions that can produce similar estimated trajectories. For instance, the resulting model may have a small MAPE (

Structural and practical identifiability analysis of S-system.

In the field of systems biology, biological reaction networks are usually modelled by ordinary differential equations. A sub-class, the S-systems repr...
564B Sizes 1 Downloads 7 Views