Reduced-Size Integer Linear Programming Models for String Selection Problems: Application to the Farthest String Problem.

JOURNAL OF COMPUTATIONAL BIOLOGY Volume 22, Number 8, 2015 # Mary Ann Liebert, Inc. Pp. 729–742 DOI: 10.1089/cmb.2014.0265

Reduced-Size Integer Linear Programming Models for String Selection Problems: Application to the Farthest String Problem ¨ RNIG PETER ZO

ABSTRACT We present integer programming models for some variants of the farthest string problem. The number of variables and constraints is substantially less than that of the integer linear programming models known in the literature. Moreover, the solution of the linear programmingrelaxation contains only a small proportion of noninteger values, which considerably simplifies the rounding process. Numerical tests have shown excellent results, especially when a small set of long sequences is given. Key words: computational biology, far from most strings problem, farthest string problem, mathematical programming, modeling.

1. INTRODUCTION

S

tring selection and comparison problems have numerous applications, principally in computational biology, but also in coding theory, data compression, and quantitative linguistics. For instance, genomic and proteomic data can be modeled as sequences (strings) over the alphabets of nucleotides or amino acids; see, for example, Blazewicz et al. (2005), Boucher (2010), and Pappalardo et al. (2013). The formalization of problems like motif recognition and similar tasks leads to diverse combinatorial optimization problems like the closest (sub) string problem, the farthest (sub) string problem, the close to most strings problem, the far from most strings problem (FFMSP), and the distinguishing string selection problem; see, for example, Soleimani-damaneh (2001), Lanctot et al. (2003), Meneses et al. (2005), Festa (2007), Zo¨rnig (2011), and Ferone et al. (2013). In the present article we study some variants of the farthest string problem (FSP) that are generally NPhard. The solution of these problems is very difficult, in particular in most biological applications where the sequences are very long; see Blazewicz et al. (2005) and Zo¨rnig (2011, p. 3). FSPs are frequently modeled as (zero–one) integer linear programming (ILP) problems; see, for example, Lanctot et al. (2003), Meneses et al. (2005), and Festa and Pardalos (2012). Our main objective is to generalize the size reduction approach of Zo¨rnig (2011), which has so far never been addressed in the literature, and apply it to the FSP. We show that at least for a small set of sequences of arbitrary length, several variants of the FSP can be solved (exactly or with a very small error in the optimal value), by merely solving the linear programming (LP) relaxation of the ILP problem and subsequent rounding of noninteger solution values. After introducing the necessary concepts, we model the FSP as an integer linear programming problem, considering two cases of

Department of Statistics, Institute of Exact Sciences, University of Brası´lia, Brası´lia, Brazil.

729

¨ RNIG ZO

730

feasible sets (sec. 3.1 and sec. 3.2). The number of variables and constraints is substantially less than that in the ILPs presented so far in the literature. In sections 3.3 and 3.4 we consider two further variants of the FSP. In one of them, the objective is to minimize the total sum of distances; the other is the FFMSP. By means of various test examples, we demonstrate that our proposed models can be easily solved. Some concluding remarks are given in section 4.

2. NOTATIONS AND BASIC CONCEPTS Extending some earlier ideas (Zo¨rnig, 2011), we provide the theoretical basis for reduced-size ILP models for string selection problems. Consider an alphabet O = f1‚ . . . ‚ xg with x 2 N, whose elements are called characters. By Om we denote the set of all sequences of length m over O. For any two sequences s‚ t 2 Om , the Hamming distance d(s, t) between s and t is defined as the number of positions in which s and t differ. Many string selection problems are of the following general form: Let S = fs1 ‚ . . . ‚ sn g be a set of n sequences with si = (si1 ‚ . . . ‚ sim ) 2 Om for i = 1‚ . . . ‚ n. The problem is given by the string matrix 0 1 1 s1 s1m B C .. (2:1) S = @ ... A . sn1

snm

whose rows consist of the sequences in S. Let Vj denote the set S of elements appearing in the jth column of S. Furthermore, let vj: = jVjj be the cardinality of Vj and V : = m j = 1 Vj the set of all characters appearing in the matrix S. The goal is to determine a string t = (t1 ‚ . . . ‚ tm ) that maximizes or minimizes a certain objective expressed in terms of Hamming distances between strings. The string t is usually constructed by choosing the element tj from the set Vj; that is, the feasible solution set is X = V1 · . . . · Vm . In particular, in the FSP the function f (t) : = i =min d(si ‚ t) is maximized over X. 1‚ ...‚ n We study some characteristics of the columns of Equation (2.1), which are n-dimensional vectors. Definition 2.1: (i) A column of Equation (2.1) is called complete if it contains all characters of the set V. Otherwise, the column is called incomplete. (ii) The induced set partition of a vector a = (a1 ‚ . . . ‚ an ) over the alphabet O is the partition of the set f1‚ . . . ‚ ng, whose parts correspond to the indices of identical values of a. (iii) Two vectors a = (a1 ‚ . . . ‚ an ) and b = (b1 ‚ . . . ‚ bn ) over O are called isomorphic if they induce the same set partition. Clearly, the above vectors a and b are isomorphic if and only if it holds ai = aj 5bi = bj for any i‚ j 2 f1‚ . . . ‚ ng. Example 2.1: Consider the vectors a = (1, 1, 2, 3, 2, 3, 1, 2, 2, 4), b = (2, 2, 4, 3, 4, 3, 2, 4, 4, 4), and c = (2, 3, 2, 3, 3, 1, 4, 4, 1, 2) of length m = 10 over the alphabet O = f1‚ . . . ‚ 4g. Now the components a1, a2, and a7 of a are identical (and all other ai are different from these). Thus, we obtain the part {1, 2, 7} of the partition. In the same way we obtain the parts {3, 5, 8, 9}, {4, 6}, and {10}. Hence, the vector a induces the partition f1‚ . . . ‚ 10g = f1‚ 2‚ 7g [ f3‚ 5‚ 8‚ 9g [ f4‚ 6g [ f10g. The same partition is obtained for the vector b, but the vector c induces another partition: f1‚ . . . ‚ 10g = f6‚ 9g [ f1‚ 3‚ 10g [ f2‚ 4‚ 5g [ f7‚ 8g. Thus, a and b are isomorphic, while a and c are not. We now define a unique representative for any isomorphism class of vectors. Definition 2.2: Let f1‚ . . . ‚ ng = C1 [ . . . [ Cr be a partition into r parts, where the Ci are labeled such that min(C1) < min(C2) < . . . < min(Cr). Then, the representative vector of the partition is defined as the vector having the number i on the positions corresponding to the part Ci (i = 1‚ . . . ‚ r)

REDUCED-SIZE ILP MODELS FOR STRING SELECTION PROBLEMS

731

Example 2.2: Consider again the partition f1‚ . . . ‚ 10g = f6‚ 9g [ f1‚ 3‚ 10g [ f2‚ 4‚ 5g [ f7‚ 8g. The minima of the parts are 6, 1, 2, and 7. Thus, the part with minimum 1, 2, 6, and 7 is denoted by C1, C2, C3, and C4, respectively, resulting in C1 = {1, 3, 10}, C2 = {2, 4, 5}, C3 = {6, 9}, and C4 = {7, 8}. The representative vector of the partition is therefore (1, 2, 1, 2, 2, 3, 4, 4, 3, 1). We are now able to define a normalized form of a string problem. Definition 2.3: The representative vector of a vector a = (a1 ‚ . . . ‚ an ) over O is defined as the representative vector of the induced partition of a. Consider the string matrix S in Equation (2.1). The normalized matrix of S, denoted by T, is the matrix obtained from S, substituting the columns by its representative vectors. The numbers mi of identical copies of representative vectors are called multiplicities. The matrix R, whose columns are the different representative vectors, is called the representative matrix. Example 2.3: Consider the following matrix S with n = 3, m = 10, and x = 4: Position S

s1

1 1

2 4

3 2

4 3

5 4

6 2

7 4

8 2

9 4

10 3

s2

3

3

1

3

2

4

1

2

1

3

3

4

3

2

1

3

2

1

3

3

1

s

The corresponding normalized string matrix is Position T

t1

1 1

2 1

3 1

4 1

5 1

6 1

7 1

8 1

9 1

10 1

t2

2

2

2

1

2

2

2

1

2

1

3

3

2

1

2

3

1

2

2

3

2

t

By ordering the columns of T, we obtain Position T~

t

4 1

8 1

10 1

3 1

6 1

2 1

7 1

1 1

5 1

9 1

t2

1

1

1

2

2

2

2

2

2

2

3

2

2

2

1

1

2

2

3

3

3

1

t Column group

1

2

3

4

Thus, the representative matrix is 0

1 @1 2

1 2 1

1 1 1 2 2A 2 3

with multiplicities (m1 ‚ . . . ‚ m4 ) = (3‚ 2‚ 2‚ 3). In principle, another representative vector corresponding to the trivial partition into one part exists, where all components are identical to 1. However, this vector occurs only when at least one column vector of the string matrix (2.1) consists of identical components. But such columns can be eliminated, since determining the corresponding elements of the searched sequence is trivial. To any feasible solution (s1 ‚ . . . ‚ sm ) of the problem S, we assign biuniquely a sequence (t1 ‚ . . . ‚ tm ) of T as follows: If sj is identical to the ith element of column j of S, we set tj equal to the ith element of the jth column of the normalized matrix T. It can be easily verified that (t1 ‚ . . . ‚ tm ) is well-defined. Formally, a biunique mapping from X = V1 · . . . · Vm to the feasible solution set W = f1‚ . . . ‚ v1 g · f1‚ . . . ‚ v2 g · . . . · f1‚ . . . ‚ vm g is defined. For instance, the feasible solution (3, 3, 2, 3, 2, 4, 1, 2, 4, 3) of X in Example 2.3 corresponds to the feasible solution (2, 2, 1, 1, 2, 2, 2, 1, 1, 1) of W (see the respective underlined elements in S and T).

¨ RNIG ZO

732

It is a crucial fact that this mapping preserves the Hamming distance; that is, the Hamming distance between two elements of X equals the Hamming distance between the corresponding elements of W. Thus, in modeling string problems one can generally work with the feasible solution space W instead of X. This may reduce the size of the ILP model significantly, since all vectors of an isomorphism class can be considered simultaneously in the formulation of the model (see sec. 3.1). In the remainder of this article, the modeling will be based on a normalized matrix T (except for sec. 3.3). Finally, we use the notations Øxø and ºxß for the smallest integer greater than or equal to x and the largest integer smaller than or equal to x, respectively.

3. SOLVING SOME VARIANTS OF THE FSP In contrast to the closest string problem (CSP), it is not immediately obvious how the feasible solution set should be defined for the FSP. In the former case, one seeks a sequence that is as close as possible to all sequences of a set S. It is clear that the element xj of a solution candidate x = (x1 ‚ . . . ‚ xm ) should be selected from the set Vj; that is, the feasible solution set is X = V1 · . . . · Vm :

(3:1)

The situation is different in the case of the FSP. Since one seeks a sequence as far as possible from all strings in S, it is in general not clear, from which character sets the elements of a solution sequence should be chosen. This depends on the respective application. In the following, we study two cases of feasible solution sets X, considered in the literature. In the first case, X is defined as in Equation (3.1); see, for example, Festa and Pardalos (2012, sec. 1.2). In the second case, the extended feasible solution set X = Vm

(3:2)

is considered [see, e.g., condition (7) in the model of Meneses et al. (2005, sec. 4.2)].

3.1. Restricted feasible solution set In analogy to the CSP model (5) in Zo¨rnig (2011, p. 6), the FSP with feasible solution set (3.1) can be modeled by the integer linear programming problem: max d s:t: m-

k X

yri‚ j ‚ j d

for i = 1‚ . . . ‚ n‚

(3:3)

j=1 vj X

yi‚ j = mj

for j = 1‚ . . . ‚ k‚

i=1

d, yi,j nonnegative integers. The variable yi,j represents the frequency of the character i in the positions of the solution sequence t = (t1 ‚ . . . ‚ tm ) that correspond to the jth representative vector. The parameters k, mj, vj denote the number of such vectors, their multiplicities, and the number of different characters in the representative vectors. The length of the sequences is m = m1 + . . . + mk . The left side of the inequalities represents the Hamming distance between ti of T and the sequence encoded by the yi,j. (Note that the first index ri,j of the variables in the inequalities denotes the ith element of the jth representative vector.) The equations express the fact that the frequencies of characters 1‚ 2‚ . . . ‚ vj in the jth isomorphism class sum up to mj. The practical utility of the above model becomes clear from the following proposition. (3.3) and Definition 3.1: Let yi‚ j denote the optimal solution values of the LP-relaxation of Equation P v fi‚ j : = yi‚ j - º yi‚ j ß, the fractional parts (0 £ fi,j < 1 for i = 1‚ . . . ‚ vj ‚ j = 1‚ . . . ‚ m). We define Fj : = i =j 1 fi‚ j , which is an integer from the set f0‚ . . . ‚ vj - jg, since the equations of (3.3) are satisfied. A standard


733

rounding rule for problem (3.3) is defined as follows. Let j 2 f1‚ . . . ‚ mg be any fixed integer. Select the Fj values from the numbers yi‚ j with the largest fractional parts (which need not be uniquely determined) and round them up. Round down the remaining values yi‚ j . Obviously, this rounding procedure results in a feasible solution of the ILP problem (3.3), which will be called the standard rounding solution. Proposition 3.1: Let d be the optimal value of the LP-relaxation of problem (3.3). Then, the objective value d of a standard rounding solution satisfies d d - k: Proof. By substituting the values yi‚ j for its rounded value, each summand in the sums of the inequalities in problem (3.3) increases at most by 1. The result implies that for a small number k of representative vectors, the standard rounding solution is a very good approximate solution for the FSP when d is large in comparison with k (see Table 1). Table 1. Standard Rounding Solution for the FSP Model (3.3) No.

m

N

drelax

dsrs

Max. error

n=6 x=2 k = 31

1 2 3 4 5 6 7 8 9 10

11,263 10,965 12,368 14,046 9290 10,858 10,306 13,434 15,943 18,821

5 2 5 5 4 4 5 4 5 5

7534.833 7310.5 8076.667 9228.833 6048 7151.167 6818.167 8721.833 10,337.67 12,193.17

7534 7310 8076 9228 6047 7151 6818 8721 10,337 12,192

0 0 0 0 1 0 0 0 0 1

n=8 x=2 k = 127

11 12 13 14 15 16 17 18 19 20

737 1463 1122 2924 5903 11,861 23,929 47,465 36,907 35,923

3 7 7 5 5 6 6 5 2 3

473.5 938.125 758.6 1875.875 3782.875 7597.375 15,335.12 30,424.88 24,858 24,290.5

473 937 757 1875 3782 7596 15,334 30,424 24,857 24,290

0 1 1 0 0 1 1 0 1 0

n=6 x=5 k = 15

21 22 23 24 25 26 27 28 29 30

2794 3305 4394 5487 6034 7856 8698 9482 10,670 11,355

8 10 9 12 11 10 6 8 10 8

2328.333 2754.167 3515.5 4398.333 5068.833 6284.667 7393.333 7775.167 8962 9084.833

2328 2754 3514 4398 5067 6282 7393 7773 8961 9084

0 0 1 0 1 2 0 2 1 0

n=8 x=7 k = 28

31 32 33 34 35 36 37 38 39 40

5530 10,420 19,673 38,912 78,452 150,230 230,722 295,015 450,723 950,643

13 14 10 13 9 8 10 12 8 13

4838.75 8336.625 17,705.875 33,075.375 67,468.75 130,700.13 189,192.25 236,012.38 396,636.88 751,007.75

4838 8335 17,705 33,074 67,468 130,698 189,189 236,012 396,633 751,006

0 1 0 1 0 2 3 0 3 1

¨ RNIG ZO

734

Example 3.1: For the problem in Example 2.3, it holds k = 4, v1 = v2 = v3 = 2, and v4 = 3. The ILP model (3.3) takes the form max d s.t. 10 - y1‚ 1 - y1‚ 2 - y1‚ 3 - y1‚ 4 d‚ 10 - y1‚ 1 - y2‚ 2 - y2‚ 3 - y2‚ 4 d‚ 10 - y2‚ 1 - y1‚ 2 - y2‚ 3 - y3‚ 4 d‚ y1‚ 1 + y2‚ 1 = m1 ‚

(3:4)

y1‚ 2 + y2‚ 2 = m2 ‚ y1‚ 3 + y2‚ 3 = m3 ‚ y1‚ 4 + y2‚ 4 + y3‚ 4 = m4 ‚ d, yi,j nonnegative integers, and the multiplicities are (m1 ‚ . . . ‚ m4 ) = (3‚ 2‚ 2‚ 3). The solution of the LP-relaxation is y1‚ 1 = 0‚

y1‚ 2 = 0‚

y1‚ 3 = 2‚

y2‚ 1 = 3‚

y2‚ 2 = 2‚

y2‚ 3 = 0‚

y1‚ 4 = 1:3‚ y2‚ 4 = 1:3‚

(3:5)

y3‚ 4 = 0:3 with optimal value d = 6: 6. One of the standard rounding solutions is therefore given by y1,4 = 1, y2,4 = 1, and y3,4 = 1 [where the other values in system (3.5) remain unchanged]. The corresponding objective value is 6 = º6: 6ß; thus, an optimal solution of the integer problem (3.4) is encountered. The optimal solution corresponding to the string matrix T in Example 2.3 is therefore ~t = (2‚ 2‚ 2j2‚ 2j 1‚ 1j1‚ 2‚ 3), corresponding to the solution t = (1, 1, 2, 2, 2, 2, 1, 2, 3, 2) of T and to the solution s = (1, 4, 1, 1, 2, 4, 4, 3, 3, 1) of S. Since problem (3.4) is a very specific case, we still consider the following somewhat more complex example. Example 3.2: Consider the problem (3.3) with 6 binary sequences (n = 6, x = 2). The representative matrix is 0

r1‚ 1 B .. R=@ . r6‚ 1

1 r1‚ 31 .. C . A

(3:6)

r6‚ 31

where 0

0

r1‚ 1 B .. @ . r6‚ 1

1 1 B2 r1‚ 6 B .. C = B B2 . A B2 B r6‚ 6 @2 2

corresponds to partitions into two parts 0 1 0 1 B1 r1‚ 7 r1‚ 21 B B .. .. C = B B2 @ . . A B2 B r6‚ 7 r6‚ 21 @2 2

1 2 1 1 1 1

1 1 2 1 1 1

1 1 1 2 1 1

1 1 1C C 1C C 1C C 1A 2

1 1 1 1 2 1

of size 1 and 5, respectively; 1 2 1 2 2 2

1 2 2 1 2 2

1 2 2 2 1 2

1 2 2 2 2 1

1 2 2 1 1 1

1 2 1 2 1 1

1 2 1 1 2 1

1 2 1 1 1 2

1 1 2 2 1 1

1 1 2 1 2 1

1 1 2 1 1 2

1 1 1 2 2 1

1 1 1 2 1 2

1 1 1C C 1C C 1C C 2A 2

REDUCED-SIZE ILP MODELS FOR STRING SELECTION PROBLEMS corresponds to partitions into two parts of size 2 0 1 0 1 B1 r1‚ 22 r1‚ 31 B B .. .. C = B B1 @ . . A B2 B r6‚ 22 r6‚ 31 @2 2

735

and 4, respectively; and 1 1 2 1 2 2

1 1 2 2 1 2

1 1 2 2 2 1

1 2 1 1 2 2

1 2 1 2 1 2

1 2 1 2 2 1

1 2 2 1 1 2

1 2 2 1 2 1

1 1 2C C 2C C 2C C 1A 1

corresponds to partitions into two parts of size 3. Problem (3.3) has now the form max d s.t. y1‚ 1 + y1‚ 2 + y1‚ 3 . . . + y1‚ 31 + d m‚ y2‚ 1 + y2‚ 2 + y1‚ 3 . . . + y2‚ 31 + d m‚ y2‚ 1 + y1‚ 2 + y2‚ 3 . . . + y2‚ 31 + d m‚ y2‚ 1 + y1‚ 2 + y1‚ 3 . . . + y2‚ 31 + d m‚

(3:7)

y2‚ 1 + y1‚ 2 + y1‚ 3 . . . + y1‚ 31 + d m‚ y2‚ 1 + y1‚ 2 + y1‚ 3 . . . + y1‚ 31 + d m‚ y1‚ j + y2‚ j = mj

for j = 1‚ . . . ‚ 31‚

d, yi,j nonnegative integers, where m = m1 + . . . + m31 , k = S(6, 2) = 31 (S(n, k) denote the Stirling numbers of the second kind, and the first index of the yi,j is the corresponding element in the matrix (3.6). In the present binary case, one can reduce the number of variables further by substituting y2,j for mj – y1,j and setting yj: = y1,j. One can verify that this yields the problem max d s.t. 0

1 1 0 y1 M1 - d B C B . C Q@ ... A @ .. A M6 - d y31

(3:8)

yj £ mj for j = 1‚ . . . ‚ 31 d, yj nonnegative integers, where the 6 · 31 matrix Q = (qi,j) has entries 1 and - 1 such that P qi,j = 1 if ri,j = 1and qi,j = - 1 if ri‚ j = 2(i = 1‚ . . . ‚ 6‚ j = 1‚ . . . ‚ 31). The constants Mi are defined by Mi = j:qi‚ j = 1 mj for i = 1‚ . . . ‚ 6. Consider, for example, the randomly generated parameter values (m1 ‚ . . . ‚ m31 ) = (633, 291, 813, 91, 889, 501, 375, 265, 218, 599, 326, 259, 128, 572, 515, 295, 359, 195, 459, 315, 215, 651, 143, 141, 378, 194, 604, 215, 230, 148, 246), summing up to m = 11,263. The LP-relaxation has the solution (y10, y14, y22, y24, y25, y27, y29, y31) = (41.16667, 14.33333, 290.1667, 141, 250.1667, 604, 79.66667, 246) (where only nonzero values are listed) with optimal value 7534.833. The corresponding standard rounding solution (y10, y14, y22, y24, y25, y27, y29, y31) = (41, 14, 290, 141, 250, 604, 80, 246) has the objective value 7534 and is therefore optimal for problem (3.8). Note that a nontrivial FSP with 6 ‘‘long’’ sequences has been solved without needing any technique to ensure the integrality requirements. P The model (3.3) has 1 + kj= 1 vj variables and k + n linear constraints, independent of the length m of the sequences. Therefore, its size is generally much smaller than that of the conventional models of Festa and Pardalos (2012, Sec. 1.2) and Meneses et al. (2005, sec. 4.2), where the size increases linearly with the length of the sequences. In particular for an FSP with 3 sequences (see problem (3.4)), we get k = 4,

¨ RNIG ZO

736

v1 = v2 = v3 = 2, v4 = 3, and the model has 10 variables and 7 linear restrictions, whatever the length of the sequences may be. For example, the standard rounding solution is the exact solution of problem (3.4) for m1 = . . . = m4 = 500 corresponding to the sequence length m = 2000. But the just mentioned models would require several thousands of restrictions and binary variables to solve an FSP with 3 sequences of that length. If some possible representative vectors do not occur in the normalized problem, the model size can be even further reduced. For example, if the third column group in problem (3.4) is empty, it follows that m3 = 0. Thus, y1,3 = y2,3 = 0; that is, one can eliminate these variables and the third equation from problem (3.4). We now determine the standard rounding solution for test examples with randomly generated multiplicities. It turns out that the corresponding objective value is always very close to the optimal value of the FSP. In fact, the maximal possible error observed in the 40 test examples of Table 1 is 3. The standard rounding solution can be easily determined, since always only a small fraction of the optimal values of the LP-relaxation is noninteger. For each test instance the following information is provided: n: number of sequences x: alphabet size k: number of representative vectors m: sequence length N: number of noninteger solution values drelax: optimal value of the LP-relaxation dsrs: objective value of the standard rounding solution maximum error: ºdrelax ß - dsrs From the examples in Table 1 it becomes clear that the FSP for a sequence length up to about 1 million can be easily solved with model (3.3) for small values of n and x. So, the actually observed objective value of the standard rounding solution is usually much better than the lower limit given by Proposition 3.1. This is an encouraging result, since finding an approximate solution for string selection problems is considered a very difficult task; see Boucher et al. (2012, p. 1). In 22 of 40 cases the standard rounding solution is in fact the exact solution of the FSP. In the remaining cases the maximum possible error is 3. In these cases the verification of optimality by means of a Branchand-Bound method may require a high computational effort. In particular, the integer solver of LINGO could not solve the ILP for example no. 10 of Table 1 in about 15 million iterations. However, an absolute error of only 3 is surely negligible for practical applications with very long sequences.

3.2. Extended feasible solution set We now assume that the feasible solution set is Equation (3.2); that is, in determining the farthest string, any position can be occupied by any element of the set V. We assume that V = O; that is, all characters appear at least once in the string matrix (2.1). Otherwise, there exists an x0 2 OnV such that (x0 ‚ . . . ‚ x0 ) 2 Om is an optimal solution of the FSP. Furthermore, we assume that the first r columns of the matrix (2.1) are complete, while the others are incomplete. In the construction of an optimal sequence t = (t1 ‚ . . . ‚ tm ), one can set ti = x i

for i = r + 1‚ . . . ‚ m

(3:9)

where xi 2 V=Vi is a character not occurring in the ith column of matrix (2.1). Thus, the incomplete columns can be exempted from further consideration. The problem simplifies considerably, when the proportion of incomplete columns is large (see sec. 4). In particular, any FSP with x > n is trivial, since all columns of matrix (2.1) are incomplete in this case and an optimal solution is given by Equation (3.9). This observation might be interesting for problems with sequences over a large alphabet that may degrade the performance of a string selection algorithm considerably; see Kuksa and Pavlovic (2009). We now solve the FSP for x £ n. Let S be the string matrix of an FST after deletion of incomplete columns. Then, there exist k = S(n, x) representative vectors, corresponding to the partitions of the set f1‚ . . . ‚ ng into x components. We now obtain the model


737

Table 2. Size of the Model (3.10) n

x

k

Variables

Linear constr.

4 5 6 6 7

3 3 4 3 6

6 25 65 90 21

19 76 261 271 127

10 30 71 96 28

max d s.t. k X yri‚ j ‚ j qd mj=1 x X

yi‚ j = mj

for i = 1‚ . . . ‚ n‚

(3:10)

for j = 1‚ . . . ‚ k‚

i=1

d, yi,j nonnegative integers, which is a special case of problem (3.3). Now the numbers of variables and linear constraints are given by 1 + kx and n + k, respectively. These numbers are calculated in Table 2 for some values of n and x. The case x = n is of particular interest, and then the standard rounding solution is always optimal. Assume that incomplete columns have been removed from the matrix (2.1); that is, all remaining columns are permutations of the column vector (1‚ . . . ‚ n). Thus, all these columns are isomorphic and one obtains the unique representation vector (1‚ . . . ‚ n). Problem (3.10) now takes the form max d s.t. m – y1 ‡ d, .. .. ..

(3:11)

m – yn ‡ d, y1 + . . . + yn = m‚ d, yi nonnegative integers, where the variables yi,1 have been renamed as yi. Proposition 3.2:

An optimal solution of problem (3.11) is given by y1 = . . . = yn =

m n

(3:12)

if m/n is an integer. Otherwise, the standard rounding solution is optimal. Proof. Evidently, the LP-relaxation of problem (3.11) has the optimal solution (3.12) with corresponding optimal value d = m - mn. If mn is not an integer, the standard rounding solution has the objective value m - Ø mn ø = ºm - mn ß. Example 3.3: Let 0

1 B2 S=B @3 4

3 4 1 2

1 3 2 4

3 4 2 1

1 2 3 4

2 1 4 3

1 2 4 3

3 2 1 4

1 4 3C C 2A 1

(3:13)

¨ RNIG ZO

738

be the string matrix of an FSP (after deletion of incomplete columns) with n = 4 and m = 9. The normalized problem has the matrix 0 1 1 1 B2 2C C T =B (3:14) @3 3A 4 4 with 9 identical columns. The LP solution of problem (3.11) is y1 = . . . = y4 = 2:25, yielding the standard rounding solution y1 = y2 = y3 = 2 and y4 = 3. Thus, an optimal solution t = (t1 ‚ . . . ‚ t9 ) of the FSP given by Equation (3.14) contains the numbers 1, 2, 3, and 4 with frequencies 2, 2, 2, and 3, respectively. In particular, t = (1, 1, 2, 2, 3, 3, 4, 4, 4) is an optimal solution that corresponds to the solution (1, 3, 3, 4, 3, 4, 3, 4, 1) of matrix (3.13).

3.3. Maximizing the sum of distances We now consider another variant of the FST where the objective is now to maximize the sum of distances instead of the minimal distance. This yields the problem max

n X

d(si ‚ t)

(3:15)

i=1

where t is chosen from X given by Equation (3.1) or (3.2). This problem, studied in Cheng et al. (2004, Sec. 3), is incomparably simpler than the maximin problem considered so far. Because of its simplicity, there is no normalization necessary. The problem (3.15) can be decomposed into subproblems as follows. For given sequences si = (si1 ‚ . . . ‚ sim ) and t = (t1 ‚ . . . ‚ tm ), we define di(j) = 0 if sij = tj and di(j) = otherwise. The Hamming distance between si and t can then be written as d(si ‚ t) =

m X

di(j)

(3:16)

j=1

and the sum (3.15) takes the form max

n X m X

di(j) = max

i=1 j=1

m X n X

di(j) :

(3:17)

j=1 i=1

P Now the expressions ni= 1 di(j) can be maximized independently from each other, and the overall solution sequence of Equation (3.17) is composed of these individual solutions. Obviously it holds n X

di(j) = n - f (tj )‚

(3:18)

i=1

where f (tj) denotes the frequency of occurrence of the character tj in the jth column of the string matrix S. Thus, Equation (3.18) is maximized by setting tj equal to one of the rarest elements in the jth column of S. In particular, if this column is incomplete, one sets tj equal to a character that does not occur in this column, corresponding to f (tj) = 0. It is now clear that the FSP (3.15) can be solved easily in linear time in the size of the string matrix. Example 3.4: Let 0

1 B2 B S=B B3 @3 3

1 4 2 1 4

2 3 1 2 4

1 1 3 2 3

3 2 3 4 4

1 2 4C C 4C C 4A 3

(3:19)

be the string matrix of an FSP with x = 4. If the feasible solution set is defined by Equation (3.1), an optimal solution of the FSP is given by (1, 2, 1, 2, 2, 3). In case of the definition (3.2), the sequence (4, 3, 4, 4, 1, 1) represents an optimal solution.


739

3.4. Far from most strings problem We still consider a variant of the FSP, called the FFMSP: Given a set S = fs1 ‚ . . . ‚ sn g of n sequences with si = (si1 ‚ . . . ‚ sim ) 2 Om for i = 1‚ . . . ‚ n and a threshold d0 > 0. The goal is to find a sequence t from a feasible set X, maximizing the number of strings si in S such that d(si ‚ t)qd0 : The set X can be defined by Equation (3.1) or (3.2), but we will restrict ourselves to the second case. By modifying problem (3.10) we obtain the following model for the FFMSP. For i = 1‚ . . . ‚ n we introduce the binary variable zi, which equals 1 if the string si belongs to the ‘‘active’’ strings, that is, the strings that are considered in the distance maximization [otherwise zi = 0; see, e.g., Meneses et al. (2005, p. 4) for a similar reasoning]. We get max z1 + . . . + zn s.t. m - zi

k X

yri‚ j ‚ j qd0

for i = 1‚ . . . ‚ n‚

(3:20)

j=1 x X

yi‚ j = mj

for j = 1‚ . . . ‚ k‚

i=1

yi,j nonnegative integers, zi 2 f0‚ 1g: The optimal value of problem (3.20) is the maximum number of strings having distance at least d0 from t. Example 3.5: Consider the case n = 4 and x = 3. The model (3.20) takes the form max z1 + z3 + z3 + z4 s.t. m - z1 (y1‚ 1 + y1‚ 2 + y1‚ 3 + y1‚ 4 + y1‚ 5 + y1‚ 6 )qd0 ‚ m - z2 (y1‚ 1 + y2‚ 2 + y2‚ 3 + y2‚ 4 + y2‚ 5 + y2‚ 6 )qd0 ‚ m - z3 (y2‚ 1 + y1‚ 2 + y3‚ 3 + y2‚ 4 + y3‚ 5 + y3‚ 6 )qd0 ‚

(3:21)

m - z4 (y3‚ 1 + y3‚ 2 + y1‚ 3 + y3‚ 4 + y2‚ 5 + y3‚ 6 )qd0 ‚ y1‚ j + y2‚ j + y3‚ j = mj

for j = 1‚ . . . ‚ 6‚

yi‚ j nonnegative integers‚ zi 2 f0‚ 1g

for i = 1‚ . . . ‚ 4:

In order to solve the nonlinear integer programming problem (3.20), we proceeded as follows: In a first phase, the relaxation is solved, obtained by omitting the integer conditions. In a second phase, we solve the modification of problem (3.20), obtained as follows: we substitute the condition zi 2 f0‚ 1g by zi = 0, if zi was smaller than 1 in the first phase solution, and by zi = 1 otherwise (i = 1‚ . . . ‚ n). Example 3.6: Assume that the multiplicities in problem (3.21) are (m1 ‚ . . . ‚ m6 ) = (796, 974, 397, 998, 789, 999), implying m = 4953. For the arbitrarily chosen threshold d0 = 4000, the relaxation of problem (3.21) has the solution (z1 ‚ z2 ‚ z3 ‚ z4 ) = (0:46‚ 1‚ 1‚ 1)‚ 0

y1‚ 1 B .. @ . y3‚ 1

1 0 y1‚ 6 0 0 .. C = @ 739:85 382:29 . A 56:15 591:71 y3‚ 6

0 397 0

692:68 575:85 0 0 305:14 213:15

(3:22) 1 825:29 173:71 A 0

¨ RNIG ZO

740 Table 3. Test Problems for the FFMSP Model (3.20) Problem dimensions No.

d0

m

First-phase Second-phase optimal value optimal value Global optimality

n=4 x=3

1 2 3 4 5 6 7 8 9 10

3000 4000 4000 5300 5500 5000 6900 5500 8000 8000

3716 4465 4953 5415 5831 6362 6916 7747 8488 9093

3.4566 2.6663 3.4551 2.0838 2.2559 3.5984 2.0082 4 2.2567 2.7218

3 2 3 2 2 3 2 4 2 2

Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

n=6 x=4

11 12 13 14 15 16 17 18 19 20

600 1200 1250 4000 8000 15,000 30,000 30,935 55,000 60,000

627 1293 1293 4113 8059 15,735 30,936 30,936 60,963 60,963

5.4909 5.1123 4.2294 4.1795 4.0382 4.5307 4.2239 4.0002 5.1899 4.0889

5 5 4 4 4 4 4 4 5 4

Yes Yes Yes Yes No Yes Yes Yes Yes Yes

n=8 x=7

21 22 23 24 25 26 27 28 29 30

6200 6300 8000 13,000 13,300 19,000 20,000 37,500 9000 11,000

6411 6411 9142 13,437 13,437 20,227 20,227 38,827 9567 11,845

7.0428 6.4807 7.9948 7.0421 6.1057 7.1054 6.1203 7.0449 7.1013 7.1425

7 6 7 7 6 7 6 7 7 7

Yes No Yes Yes Yes Yes No Yes Yes Yes

with corresponding optimal value 3.46. We modify problem (3.21) by substituting the conditions zi 2 f0‚ 1g for i = 1‚ . . . ‚ 4 by z1 = 0, z2 = 1, z3 = 1, and z4 = 1. The resulting linear program has the solution 0

y1‚ 1 B .. @ . y3‚ 1

z1 = 0‚ z2 = 1‚ z3 = 1‚ z4 = 1‚ 1 0 y1‚ 6 240 0 0 998 .. C = @ 556 21 0 0 . A 0 953 397 0 y3‚ 6

(3:23) 789 0 0

1 999 0 A 0

with objective value 3.We emphasize that the integer values of the yi,j occurred ‘‘automatically’’ by solving the linear problem (no measures were taken to assure integrality); Equation (3.23) also solves the integer problem (3.21) since the first phase solution was optimal for the relaxation. In Table 3 the above simple heuristic has been applied to 30 test problems with different numbers n, m, and x. The multiplicities mj in problem (3.20) have been chosen at random and the threshold d0 has been determined arbitrarily. Since the relaxation of problem (3.20) is not a convex optimization problem, the solution may be only a local optimum, depending on the starting values and solution procedure applied by the used software. In our test problems performed by the software LINGO, global optimality was achieved in 90% of the instances. It is very interesting to observe that in all test runs of Table 3 the second-phase solution has exclusively integer values. As this table shows, the objective value of the second phase was always ºzß, where z denotes the objective value of the first phase. Thus, when the first phase terminated with a global optimum, the second phase provided a global optimal solution for the FFMSP (3.20). This is an


741

encouraging result, since the FFMSP is considered one of the most difficult sequence consensus problems; see, for example, Ferone et al. (2013, p. 2).

4. PROBABILISTIC CONSIDERATIONS AND CONCLUDING REMARKS The above elementary solution techniques work very well for a small number n of sequences. It is of course necessary to perform further tests for FSPs with a larger set of sequences. The number of possible representative vectors increases with n (see, e.g., Table 2); however, not all these vectors need to occur in a normalized problem (see sec. 3.1). It would be interesting to study the number k of representative vectors and their multiplicities occurring in practical applications. Assuming that the characters are randomly chosen from the alphabet, one can estimate k by means of simulations. It is intuitively clear that columns with a small number of different characters are rare in the random case. In particular, one can determine the probability that a column is incomplete. Proposition 4.1: Assume that the n · m matrix (2.1) is randomly constructed such that every character of O = f1‚ . . . ‚ xg appears with equal probability. Then, a specific column of S is incomplete with probability x -1 X i n i-1 x P(n‚ x) = ( - 1) : 1i x i=1 Proof. Let Ai denote the event that a given column does not contain the character i. Then, from the inclusion–exclusion principle it follows that P(n‚ x) = P(A1 [ . . . [ Ax ) =

x X i=1

P(Ai ) -

x X

P(Ai \ Aj ) + . . .

1pi < jpx

+ ( - 1)x - 1 P(A1 \ . . . \ Ax ) x 1 n x - 1 n x x - 2 n x =x + . . . + ( - 1) ‚ 2 x x x-1 x -

implying the statement.

Consider, for example, an application with n = 40 protein sequences composed of x = 20 characters corresponding to amino acids. Assuming that all characters appear with the same probability, a column of the string matrix (2.1) is then incomplete with probability P(40, 20) & 0.964; that is, more than 96% of all columns are expected to be incomplete and hence can be exempted from further consideration (see Sect. 3.2). This fact simplifies the solution considerably. It is also necessary to realize further test runs with model (3.20) for the FFMSP. In particular, it must be checked under which circumstances the solution values are integers in the respective two phases. There also arise some theoretical questions regarding the main model (3.3). Among others, it should be investigated how parameter changes influence the solution. In particular, one can observe the following ‘‘linearity condition’’: if Y = (yi,j) solves the LP-relaxation of problem (3.3) for the parameters m1 ‚ . . . ‚ mk , then cY is the ‘‘relaxed’’ solution for the parameters cm1 ‚ . . . ‚ cmk . In particular, it would be interesting to derive a better lower bound for the objective value of the standard rounding solution than that provided in Proposition 3.1. Such a bound is closely related to the number N of noninteger values in the relaxed solution. In this context it is interesting to observe that the coefficients of the variables yi,j in the model (3.3) are – 1. It might be worthwhile to examine if the low number of noninteger values can be explained by a generalized unimodularity concept; see, for example, Kotnyek (2002).

AUTHOR DISCLOSURE STATEMENT No competing financial interests exist.

¨ RNIG ZO

742

REFERENCES Blazewicz, J., Rormanowicz, P., and Kasprzak, M. 2005. Selected combinatorial problems of computational biology. Eur. J. Oper. Res. 161, 585–597. Boucher, C.A. 2010. Combinatorial and probabilistic approaches to motif recognition [PhD dissertation]. University of Waterloo, Waterloo, Ontario, Canada. Boucher, C.A., Landau, G.M., Levy, A., and Pritchard, D. 2012. On approximating string selection problems with outliers. http://arxiv.org/pdf/1202.2820.pdf Cheng, C.H., Huang, C.C., Hu, S.Y., and Chao, K.M. 2004. Efficient algorithms for some variants of the farthest string problem. In Proceedings of Workshop on Combinatorial Mathematics and Computation Theory, pp. 266–272. Ferone, D., Festa, P., and Resende, M.G.C. 2013. Hybrid metaheuristics for the far from most string problem. In Blesa, M.J., et al., eds. Proceedings of Hybrid Metaheuristics: Lecture Notes in Computer Science, pp. 174–188. Festa, P. 2007. On some optimization problems in molecular biology. Math. Biosci. 207, 219–234. Festa, P., and Pardalos, P.M. 2012. Efficient solution for the far from most string problem. Ann. Oper. Res. 196, 663–682. Kotnyek, B. 2002. A generalization of totally unimodular and network matrices [PhD dissertation]. London School of Economics. Kuksa, P.P., and Pavlovic, V. 2009. Efficient discovery of common patterns in sequences over large alphabets. DIMACS Technical Report 2009. www.dimacs.rutgers.edu/TechnicalReports/TechReports/2009/2009-15.pdf Lanctot, J., Li, M., Ma, B., et al. 2003. Distinguishing string selection problems. Inf. Comput. 185, 41–55. Meneses, C.N., Pardalos, P.M., Resende, M.G.C., and Vazacopoulos, A. 2005. Modeling and solving string selection problems. In Proceedings of the 2005 International Symposium on Mathematical and Computational Biology, Biomat 2005, Rio de Janeiro. Pappalardo, E., Pardalos, P.M., and Stracquadanio, G. 2013. Optimization Approaches for Solving String Selection Problems. Springer, New York, NY. Soleimani-damaneh, M. 2011. On some multiobjective optimization problems arising in biology. Int. J. Comput. Math. 88, 1103–1119. Zo¨rnig, P. 2011. Improved optimization modelling for the closest string and related problems. Appl. Math. Model. 35, 5609–5617.

Address correspondence to: Dr. Peter Zo¨rnig Department of Statistics Institute of Exact Sciences University of Brasılia Asa Norte, Campus da UnB 70910-900 Brasılia, DF Brazil E-mail: [email protected]

An Integer Programming Formulation of the Minimum Common String Partition Problem.

Optimising the selection of food items for FFQs using Mixed Integer Linear Programming.

Western scrub-jays (Aphelocoma californica) solve multiple-string problems by the spatial relation of string and reward.

Chemical bonding in a linear chromium metal string complex.

Split diversity in constrained conservation prioritization using integer linear programming.

Combinatorial therapy discovery using mixed integer linear programming.

The use of integer programming in dairy sire selection.

Aspect-object alignment with Integer Linear Programming in opinion mining.

Letter: Ding-string injuries.

Charged string tensor networks.

STAMS: STRING-assisted module search for genome wide association studies and application to autism.

Efficient privacy-preserving string search and an application in genomics.

String theory--the physics of string-bending and other electric guitar techniques.

String-mounted double-headed capsule and Z-line detection: is string really the thing?

A recurrent neural network for solving bilevel linear programming problem.

Abstract knowledge in the broken-string problem: evidence from nonhuman primates and pre-schoolers.

Testing problem solving in turkey vultures (Cathartes aura) using the string-pulling test.

Dark energy from the string axiverse.

The purse-string suture in facial reconstruction.

The string-pulling paradigm in comparative psychology.

Hybrid biogeography-based optimization for integer programming.

The renal artery string-of-pearls sign.

The string sign for diagnosis of mucinous pancreatic cysts.

String model for the dynamics of glass-forming liquids.