1384

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

VOL. 10,

NO. 6,

NOVEMBER/DECEMBER 2013

Inapproximability of ð1; 2Þ-Exemplar Distance Laurent Bulteau and Minghui Jiang Abstract—Given two genomes possibly with duplicate genes, the exemplar distance problem is that of removing all but one copy of each gene in each genome, so as to minimize the distance between the two reduced genomes according to some measure. Let ðs; tÞexemplar distance denote the exemplar distance problem on two genomes G1 and G2 , where each gene occurs at most s times in G1 and at most t times in G2 . We show that the simplest nontrivial variant of the exemplar distance problem, ð1; 2Þ-EXEMPLAR DISTANCE, is already hard to approximate for a wide variety of distance measures, including both popular genome rearrangement measures such as adjacency disruptions, signed reversals, and signed double-cut-and-joins, and classic string edit distance measures such as Levenshtein and Hamming distances. Index Terms—Comparative genomics, hardness of approximation, adjacency disruption, sorting by reversals, double-cut-and-join, edit distance, Levenshtein distance, Hamming distance

Ç 1

I

INTRODUCTION

the study of genome rearrangement, a gene is usually represented by a signed integer: the absolute value of the integer (the unsigned integer) denotes the gene family to which the gene belongs; the sign of the integer denotes the orientation of the gene in its chromosome. Then, a chromosome is a sequence of signed integers, and a genome is a collection of chromosomes. Given two genomes possibly with duplicate genes, the exemplar distance problem [15] is that of removing all but one copy of each gene in each genome, so as to minimize the distance between the two reduced genomes according to some measure. The reduced genomes are said to be exemplar subsequences of the original genomes. This approach amounts to considering that, in the evolution history, duplications have taken place after the speciation of the genomes (or more generally, that we are able to distinguish genes that have been duplicated before the speciation). Hence, in each genome, only one copy of each gene may be matched to an ortholog gene in the other genome. For example, the following two monochromosomal genomes: N

G1 : 4 þ1 þ2 þ3 5 þ1 þ2 þ3 6 G2 : 1  4 þ1 þ2 5 þ3 2 6 þ 3; can both be reduced to the same genome G0 : 4 þ1 þ2 5 þ3 6; by removing duplicates; thus, they have exemplar distance zero for any reasonable distance measure. In general, . L. Bulteau is with the Laboratoire d’Informatique de Nantes-Atlantique (LINA), UMR CNRS 6241, Universite´ de Nantes, 2 rue de la Houssinie`re, 44322 Nantes, France. E-mail: [email protected]. . M. Jiang is with the Department of Computer Science, Utah State University, 4205 Old Main Hill, Logan, UT 84322-4205. E-mail: [email protected]. Manuscript received 6 Aug. 2012; revised 25 Oct. 2012; accepted 1 Nov. 2012; published online 28 Nov. 2012. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBBSI-2012-08-0194. Digital Object Identifier no. 10.1109/TCBB.2012.144. 1545-5963/13/$31.00 ß 2013 IEEE

unless we are to decide simply whether two genomes can be reduced to the same genome by removing duplicates, the exemplar distance problem is not a single problem but a group of related problems because the choice of the distance measure is not unique. We refer to Fig. 1 for an example scenario where the underlying distance measure is the signed reversal distance. We denote by ðs; tÞ-EXEMPLAR DISTANCE the exemplar distance problem on two genomes G1 and G2 , where each gene occurs at most s times in G1 and at most t times in G2 . It is known [5], [13] that for any reasonable distance measure, ð2; 2Þ-EXEMPLAR DISTANCE does not admit any approximation. This is because to decide simply whether two genomes with maximum occurrence 2 can be reduced to the same genome by removing duplicates is already NP-hard. In this paper, we focus on the simplest nontrivial variant of the exemplar distance problem: ð1; 2Þ-EXEMPLAR DISTANCE. The problem ð1; tÞ-EXEMPLAR DISTANCE has been studied for several distance measures commonly used in genome rearrangement. Angibaud et al. [2] showed that ð1; 2Þ-EXEMPLAR BREAKPOINT DISTANCE, ð1; 2Þ-EXEMPLAR COMMON INTERVAL DISTANCE, and ð1; 2Þ-EXEMPLAR CONSERVED INTERVAL DISTANCE are all APX-hard. Blin et al. [4] showed that ð1; 9Þ-EXEMPLAR MAD DISTANCE is NP-hard to approximate within 2   for any  > 0, and that ð1; 1Þ-EXEMPLAR SAD DISTANCE is NPhard to approximate within c log n for some constant c > 0, where n is the number of genes in G1 . See also [6], [8], [9] for related results. The two distance measures we first consider, maximum adjacency disruption (MAD) and summed adjacency disruption (SAD), were introduced by Sankoff and Haque [16]. In any two genomes represented by two different permutations of the same set of genes, there exist pairs of genes that are adjacent in one genome but some distances apart in the other genome. Intuitively, the MAD distance measures the maximum distance of such disruptions, and the SAD distance measures the total distance of such disruptions over all adjacencies. More formally, given two permutations 0 ¼ 01 . . . 0n and 00 ¼ 001 . . . 00n of n distinct elements, define  0 ðiÞ as the index j such that 0j ¼ 00i , and  00 ðiÞ as the index j Published by the IEEE CS, CI, and EMB Societies & the ACM

BULTEAU AND JIANG: INAPPROXIMABILITY OF ð1; 2Þ-EXEMPLAR DISTANCE

1385

Fig. 1. During the evolution of two different species from a common ancestor, duplications occur in G2 , and reversals occur in both G1 and G2 . By the parsimony principle, the exemplar distance of 3 between G1 and G2 corresponds to the number of reversal events in the most likely evolution history of the two species.

such that 00j ¼ 0i . Then, the MAD and SAD distances between 0 and 00 are MADð0 ; 00 Þ ¼ max f j 0 ðiÞ   0 ði þ 1Þj; j 00 ðiÞ   00 ði þ 1Þj g; 1in1

SADð0 ; 00 Þ X ð j 0 ðiÞ   0 ði þ 1Þj þ j 00 ðiÞ   00 ði þ 1Þj Þ: ¼ 1in1

We note that MAD and SAD distances are not distances in the strict mathematical sense because they are not zero for identical permutations: For a permutation  of length n, it follows from the above definitions that MADð; Þ ¼ 1 and SADð; Þ ¼ 2n  2. Our first two theorems sharpen the previous results on the inapproximability on ð1; tÞ-EXEMPLAR DISTANCE for both MAD and SAD measures: Theorem 1. ð1; 2Þ-EXEMPLAR MAD DISTANCE is NP-hard to approximate within 2   for any  > 0. Theorem 2. ð1; 2Þ-EXEMPLAR pffiffiffi SAD DISTANCE is NP-hard to approximate within 10 5  21   ¼ 1:3606 . . .  , and is NP-hard to approximate within 2   if the unique games conjecture is true, for any  > 0. For an unsigned permutation  ¼ 1 . . . n , an unsigned reversal ði; jÞ with 1  i  j  n turns it into 1 . . . i1 j . . . i jþ1 . . . n , where the substring i . . . j is reversed. For a signed permutation  ¼ 1 . . . n , a signed reversal ði; jÞ with 1  i  j  n turns it into 1 . . . i1 j . . . i jþ1 . . . n , where the substring i . . . j is reversed and negated. (see Fig. 2a). The unsigned reversal distance (signed reversal distance, respectively) between two unsigned (signed, respectively) permutations is the minimum number of unsigned (signed, respectively) reversals required to transform one to the other. Computing the unsigned reversal distance is APX-hard [3], although the signed reversal distance can be computed in polynomial time [12]. Our next theorem answers an open question of Blin et al. [4] on the inapproximability of the exemplar reversal distance problem: Theorem 3. ð1; 2Þ-EXEMPLAR SIGNED REVERSAL DISTANCE is NP-hard to approximate within 1;237=1;236   for any  > 0. The double-cut-and-join (DCJ) operation, introduced by Yancopoulos et al. [17], consists in cutting the permutation

in two positions, and joining the four ends in any new way. In practice, a DCJ operation can correspond to a reversal, to the excision of a substring into a circular permutation, or to the insertion of a circular permutation back into the main sequence, at any position (see Fig. 2). The problem of computing the DCJ distance between two permutations is known to be polynomial in the signed case [17], and is NPhard in the unsigned case [7]. The following theorem shows the intractability of the exemplar DCJ problem: Theorem 4. ð1; 2Þ-EXEMPLAR SIGNED DCJ DISTANCE is NPhard to approximate within 1;237=1; 236   for any  > 0. In the last theorem of this paper, we present the first inapproximability result on the exemplar distance problem using the classic string edit distance measure: Theorem 5. ð1; 2Þ-EXEMPLAR EDIT DISTANCE is APX-hard to compute when the cost of a substitution is 1 and the cost of an insertion or a deletion is at least 1. Note that both Levenshtein distance and Hamming distance are special cases of the string edit distance: For Levenshtein distance, the cost of every operation (substitution, insertion, or deletion) is 1; for Hamming distance, the cost of a substitution is 1 and the cost of an insertion or a deletion is þ1. Thus, we have the following corollaries: Corollary 1. ð1; 2Þ-EXEMPLAR LEVENSHTEIN DISTANCE is APX-hard. Corollary 2. ð1; 2Þ-EXEMPLAR HAMMING DISTANCE is APX-hard. Our choices of the specific distance measures studied in this paper are based on two considerations. First, for a broader impact, we try to explore a wide variety of distance

Fig. 2. The possible operations allowed for the signed DCJ distance are (a) reversals, (b) excisions, and (c) insertions. We write circular permutations with parentheses, i.e., ðþ1 þ2 þ3Þ is equal to ðþ2 þ3 þ1Þ and to ð3 2 1Þ.

1386

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

measures, which are suitable for different requirements of various biological applications, ranging from the ones measuring local differences such as Hamming and Levenshtein distances to those computing global rearrangement schemes such as reversal and DCJ distances. Second, in terms of computational complexity, the exemplar generalization of any measure for sequences with duplicates can only be harder to compute than the basic version of the same measure for sequences without duplicates. In order to obtain unambiguous results on the true difficulty of the exemplar distance problem, we restrict ourselves to measures whose basic versions are easy to compute. For example, given any two sequences, their Hamming distance can be trivially computed in linear time, and their Levenshtein distance can be computed in quadratic time by dynamic programming. Also, MAD and SAD distances between permutations admit straightforward polynomial-time algorithms following their definitions, and less straightforward but still polynomialtime algorithms exist for signed reversal distance [12] and signed DCJ distance [17].

2

MAD DISTANCE

In this section, we prove Theorem 1. We show that EXEMPLAR MAD DISTANCE is NP-hard to approximate by a reduction from the well-known NP-hard problem 3SAT [11]. Let ðV ; CÞ be a 3SAT instance, where V ¼ fv1 ; . . . ; vn g is a set of n Boolean variables, C ¼ fc1 ; . . . ; cm g is a conjunctive Boolean formula of m clauses, and each clause in C is a disjunction of exactly three literals of the variables in V . The problem 3SAT is that of deciding whether ðV ; CÞ is satisfiable, i.e., whether there is a truth assignment for the variables in V that satisfies all clauses in C. Let M ¼ ððm þ nÞ=Þ be a large number to be specified. We will construct two sequences (genomes) G1 and G2 over L ¼ 3m þ ðn þ 1Þ þ ð2n þ 1Þ þ ðm þ 1Þ þ ð2M þ 2Þ ¼ 2M þ 3n þ 4m þ 5 distinct genes: Three literal genes rj ; sj ; tj for the three literals of each clause cj , 1  j  m; . n þ 1 variable genes xi , 0  i  n; . 2n þ 1 separator genes yi , 0  i  2n; . m þ 1 clause genes zj , 0  j  m; . 2M þ 2 dummy genes k and k , 0  k  M. For each clause cj , let Oj ¼ rj sj tj be the concatenation of the three literal genes of cj . For each variable vi , let Pi ¼ pi;1 . . . pi;ki be the concatenation of the ki literal genes of the positive literals of vi , and let Qi ¼ qi;1 ; . . . ; qi;li be the concatenation of the li literal genes of the negative literals of vi . Without loss of generality, assume that minfki ; li g  1. Note that the two concatenated sequences O1 . . . Om and P1 Q1 . . . Pn Qn are both permutations of the 3m literal genes. The two sequences G1 and G2 are represented schematically as follows: G1 contains exactly one copy of each gene, and has length L; G2 contains exactly two copies of each literal gene and exactly one copy of each nonliteral gene, and has length L þ 3m .

VOL. 10,

NO. 6,

NOVEMBER/DECEMBER 2013

G1 : . . . z3 z1 0 . . . x2 x0 M . . . 1 y0 P1 y1 Q1 y2 . . . z0 z2 . . . x1 x3 . . . Pn y2n1 Qn y2n 1 ... M 0 G2 : xn Pn Qn . . . x1 P1 Q1 x0 M . . . 1 0 y0 y1 y2 . . . y2n1 y2n 0 1

...

z0 O1 z1 . . . Om zm :

M

Lemma 1. If ðV ; CÞ is satisfiable, then G2 has an exemplar subsequence G02 that satisfies MADðG1 ; G02 Þ  M þ 3n þ 4m þ 5: Proof. Let f be a truth assignment for the variables in V that satisfies all clauses in C. For each variable vi , compose a subsequence Vi of Pi Qi such that Vi ¼ Qi if fðvi Þ is true and Vi ¼ Pi if fðvi Þ is false. For each clause cj , compose a subsequence Cj of Oj containing only the literal genes of the literals that are true under the assignment f. Then, V1 . . . Vn C1 . . . Cm is a permutation of the 3m literal genes. Moreover, none of the n þ m subsequences Vi and Cj is empty G1 : . . . z3 z1 0 Pn y2n1 Qn y2n

. . . x 2 x 0 M . . .  1 z0 z2 . . . 1 ... M

y0 P1 y1 Q1 y2 . . . x1 x3 . . . 0

G2 : xn Pn Qn . . . x1 P1 Q1 x0 M . . . 1 0 y0 y1 y2 . . . z0 O1 z1 . . . Om zm y2n1 y2n 0 1 ... M 0 G2 : xn Vn . . . x1 V1 x0 M . . . 1 0 y0 y1 y2 . . . y2n1 y2n 0 1

...

M

z0 C1 z1 . . . Cm zm :

It is straightforward to verify that, between G1 and the exemplar subsequence G02 of G2 shown above, any two adjacent genes in one sequence can be nondjacent in the other sequence only if they are both after M . . . 1 or both before 1 . . . M in the latter sequence. This implies u t that MADðG1 ; G02 Þ  L  M ¼ M þ 3n þ 4m þ 5. Lemma 2. If ðV ; CÞ is not satisfiable, then every exemplar subsequence G02 of G2 satisfies MADðG1 ; G02 Þ > 2M: Proof. We prove the contrapositive. Suppose G2 has an exemplar subsequence G02 that satisfies MADðG1 ; G02 Þ  2M. We will find a truth assignment f for the variables in V that satisfies all clauses in C. First, we claim that for each variable vi , the literal genes of the positive literals of vi must appear in G02 either all before M or all after M . Suppose the contrary. Then, there would be two literal genes of vi , one before M and one after M in G02 , that are adjacent in the substring Pi in G1 , incurring a MAD distance larger than 2M. Similarly, we claim that the literal genes of the negative literals of each variable vi must appear in G02 either all before M or all after M . Next, we claim that for each variable vi , the literal genes of either all positive literals of vi or all negative literals of vi must appear in G02 before M , between xi and xi1 . Suppose the contrary that all literal genes of both the positive and the negative literals of vi appear in G02 after M . Then, the two variable genes xi and xi1 , one before M and one after M in G1 , would become adjacent in G02 , incurring a MAD distance larger than 2M.

BULTEAU AND JIANG: INAPPROXIMABILITY OF ð1; 2Þ-EXEMPLAR DISTANCE

1387

The two sequences G1 and G02 have the same length L ¼ n þ m þ M þ 1 and together have 2n þ 2m þ 2M adjacencies. The contributions of these adjacencies to SADðG1 ; G02 Þ are as follows:

Finally, we claim that for each clause cj , at least one of the three literal genes rj ; sj ; tj must appear in G02 after M , between zj1 and zj . Suppose the contrary. Then, the two clause genes zj1 and zj , one before M and one after 0 M in G1 , would become adjacent in G2 , again incurring a MAD distance larger than 2M. Now compose a truth assignment f for the variables in V such that fðvi Þ is true if the literal genes for the negative literals of vi appear before M in G02 , and is false otherwise. Then, for each clause cj , the literal genes rj ; sj ; tj that appear after M in G02 must correspond to true literals. Since at least one of the three literal genes of each clause appears after M in G02 , f satisfies all clauses in C. u t For any constant , 0 <  < 2, we can get a gap of 2M=ðM þ 3n þ 4m þ 5Þ ¼ 2   by setting M ¼ ð2  1Þð3n þ 4m þ 5Þ. Thus, the NP-hardness of 3SAT and the two preceding lemmas together imply that EXEMPLAR MAD DISTANCE is NP-hard to approximate within 2   for any  > 0.

The shared adjacencies i iþ1 in G1 and G02 , 0  i  M  1, contribute a total value of exactly 2M. 2. The adjacency em 0 in G1 contributes a value of at least M and at most M þ n þ m. 3. Each adjacency between an edge gene and a nonedge gene in G02 contributes a value of at least M and at most M þ n þ m. 4. Each remaining adjacency contributes a value of at least 1 and at most n þ m. The number of adjacencies between an edge gene and a nonedge gene in G02 is exactly twice the size of the vertex cover C. Thus, we have 1.

SADðG1 ; G02 Þ  2M þ ð2k þ 1ÞðM þ n þ mÞ þ ð2n þ 2m þ 2M  2M  2k  1Þðn þ mÞ ¼ ð2k þ 3ÞM þ 2ðn þ mÞ2 ¼ ð2k þ 4ÞM:

3

SAD DISTANCE

We next prove the reverse implication. Let G02 be an exemplar subsequence of G2 such that SADðG1 ; G02 Þ  ð2k þ 4ÞM. Refer back to the list of contributions to SADðG1 ; G02 Þ. Let l be the number of adjacencies between an edge gene and a nonedge gene in G02 . Then, we have the following inequality:

In this section, we prove Theorem 2. We show that EXEMPLAR SAD DISTANCE is NP-hard to approximate by a reduction from another well-known NP-hard problem MINIMUM VERTEX COVER [11]. Let ðV ; EÞ be a graph, where V ¼ fv1 ; . . . ; vn g is a set of n vertices, and E ¼ fe1 ; . . . ; em g is a set of m edges. The problem MINIMUM VERTEX COVER is that of finding a subset C  V of the minimum cardinality such that each edge in E is incident to at least one vertex in C. Let M ¼ 2ðn þ mÞ2 . We will construct two sequences (genomes) G1 and G2 over L ¼ n þ m þ M þ 1 distinct genes: . n vertex genes vi , 1  i  n; . m edge genes ej , 1  j  m; . M þ 1 dummy genes k , 0  k  M. For each vertex vi , let Ei ¼ ei;1 . . . ei;ki be the concatenation of the edge genes of all edges incident to vi , where ki is the degree of vi . The two sequences G1 and G2 are represented schematically as follows: G1 contains exactly one copy of each gene, and has length L; G2 contains exactly two copies of each edge gene and exactly one copy of each nonedge gene, and has length L þ m G1 : e1 . . . em 0 1 . . . M v1 . . . vn G2 : 0 1 . . . M E1 v1 . . . En vn : Lemma 3. G has a vertex cover of size at most k if and only if G2 has an exemplar subsequence G02 that satisfies SADðG1 ; G02 Þ  ð2k þ 4ÞM. Proof. We first prove the direct implication. Let C be a vertex cover of size at most k in G. Extract a subsequence Ei0 of Ei for each vertex vi in C such that the concatenated sequence E10 . . . En0 contains each edge gene ej exactly once. From G2 , remove Ei for each vertex vi not in C, and replace Ei by Ei0 for each vertex vi in C. Then, we obtain an exemplar subsequence G02 of G2 .

SADðG1 ; G02 Þ  2M þ ðl þ 1ÞM ¼ ðl þ 3ÞM: Since SADðG1 ; G02 Þ  ð2k þ 4ÞM, we have l þ 3  2k þ 4 and hence l  2k þ 1. Note that l must be an even number: For each adjacency between an edge gene in Ei and a nonedge gene to its left, there must be another adjacency between an edge gene in Ei and a nonedge gene (indeed a vertex gene) to its right, and vice versa. It follows that l  2k, and there are at most k vertex genes vi that are adjacent to an edge gene to its left. The corresponding at most k vertices vi form a vertex cover of G. u t Dinur and Safra [10] showed that MINIMUM VERTEX COVER p isffiffiNP-hard to approximate within any constant less ffi than 10 5  21 ¼ 1:3606 . . . . Khot and Regev [14] showed that MINIMUM VERTEX COVER is NP-hard to approximate within any constant less than 2 if the unique games conjecture is true. The inapproximability of MINIMUM VERTEX COVER and the preceding lemma together imply that EXEMPLAR pffiffiffi SAD DISTANCE is NP-hard to approximate within 10 5  21  , and is NP-hard to approximate within 2   if the unique games conjecture is true, for any  > 0.

4

SIGNED REVERSAL AND DCJ DISTANCES

In this section, we prove Theorems 3 and 4. We first show that ð1; 2Þ-EXEMPLAR SIGNED REVERSAL DISTANCE is APX-hard by a reduction from the problem MIN-SBR [3], which asks for the minimum number of unsigned

1388

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

reversals to sort a given unsigned permutation into the identity permutation. Let  ¼ 1 . . . n be an unsigned permutation of 1 . . . n. We construct two sequences G1 ¼ þ1 . . . þn and G2 ¼ þ1 1 . . . þn n . Lemma 4.  can be sorted into the identity permutation 1 . . . n by at most k unsigned reversals if and only if G2 has an exemplar subsequence G02 with signed reversal distance at most k from G1 . Proof. We say that a signed permutation  is a signed version of  if for all 1  i  n, i ¼ ji j. The lemma is based on two key observations. First, the permutation  can be sorted in k reversals if and only if there exists a signed version  of  that can be sorted in k (signed) reversals. Second, a signed permutation  is an exemplar subsequence of G2 if and only if it is a signed version of , that is, for all 1  i  n, i ¼ ji j. The first observation is a classical result: Given a sequence of reversals sorting , construct  by applying the same sequence in reversed order from the signed identity permutation. And conversely, any sequence of signed reversals sorting a signed version of , seen as a sequence of unsigned reversals, transforms  into the identity. The second observation is obtained by construction of G2 : Any signed version of  can be seen as an exemplar subsequence of G2 , and all exemplar subsequences of G2 are signed versions of . The lemma is directly deduced from these two equivalences:  can be sorted by at most k unsigned reversals . .

,  has a signed version  that can be sorted by at most k unsigned reversals , G2 has an exemplar subsequence G02 ¼  with u t signed reversal distance at most k from G1 .

Since MIN-SBR is NP-hard to approximate within 1;237=1;236   for any  > 0 [3], ð1; 2Þ-EXEMPLAR SIGNED REVERSAL DISTANCE is NP-hard to approximate within 1;237=1;236   for any  > 0 too. We now prove Theorem 4 by a reduction from SORTING BY UNSIGNED DCJ [7]. Given an unsigned permutation , compose the same sequences G1 and G2 as before: G1 ¼ þ1 . . . þn and G2 ¼ þ1 1 . . . þn n . We have the following lemma: Lemma 5.  can be sorted into the identity permutation 1 . . . n by at most k unsigned DCJs if and only if G2 has an exemplar subsequence G02 with signed DCJ distance at most k from G1 . Proof. As in the proof of Lemma 4, this result is obtained from the following two equivalences:  can be sorted by at most k unsigned DCJs . .

,  has a signed version  that can be sorted by at most k unsigned DCJs , G2 has an exemplar subsequence G02 ¼  with u t signed DCJ distance at most k from G1 .

The problem SORTING BY UNSIGNED DCJ has been proved to be NP-hard [7]. We note that it is in fact NPhard to approximate within 1;237=1;236   for any  > 0 because, according to [7, Theorem 2], SORTING BY

VOL. 10,

NO. 6,

NOVEMBER/DECEMBER 2013

UNSIGNED DCJ has the same objective function as BREAKPOINT GRAPH DECOMPOSITION (formulated as a minimization problem), and the latter is known to be NPhard to approximate within 1;237=1; 236   for any  > 0 [3, Theorem 4]. It follows that ð1; 2Þ-EXEMPLAR SIGNED DCJ DISTANCE is also NP-hard to approximate within 1;237=1;236   for any  > 0.

5

EDIT DISTANCE

In this section, we prove Theorem 5. For any edit distance where the cost of a substitution is 1 and the cost of an insertion or a deletion is at least 1 (possibly þ1), we show that the problem ð1; 2Þ-EXEMPLAR EDIT DISTANCE is APXhard by a reduction from the problem MINIMUM VERTEX COVER IN CUBIC GRAPHS. Let G ¼ ðV ; EÞ be a cubic graph of n vertices and m edges, where 3n ¼ 2m. We will construct two sequences (genomes) G1 and G2 over an alphabet of 3m þ 4n þ 2ðm þ 7nÞ þ 2ðm  1Þ þ ðn  1Þ; distinct genes. For each edge e ¼ fu; vg 2 E, we have three edge genes e, eu , and ev . For each vertex v 2 V , we have a vertex gene v and three dummy genes v01 , v02 , v03 . In addition, we have 2ðm þ 7nÞ þ 2ðm  1Þ þ ðn  1Þ genes for separators. The construction is illustrated in Fig. 3 for the complete graph K4 . The two sequences G1 and G2 are composed from m þ n þ 1 gadgets: an edge gadget for each edge, a vertex gadget for each vertex, and a tail gadget. The m þ n þ 1 gadgets are separated by m þ n separators of total length 2ðm þ 7nÞ þ 2ðm  1Þ þ ðn  1Þ: two long separators, each of length m þ 7n: one between the last edge gadget and the first vertex gadget, one between the last vertex gadget and the tail gadget; . m þ n  2 short separators: a length-2 separator between any two consecutive edge gadgets, and a length-1 separator between any two consecutive vertex gadgets. For each edge e ¼ fu; vg, the edge gadget for e is .

G1 hei ¼ e G2 hei ¼ eu ev : For each vertex v incident to edges e; f; g, the vertex gadget for v is G1 hvi ¼ v v01 v02 v03 G2 hV i ¼ ev fv gv v e f g: Let V 0 be the 3n genes v01 , v02 , v03 for v 2 V . Let E 0 be the 2m ¼ 3n genes eu and ev for e ¼ fu; vg 2 E. The tail gadget is G1 htaili ¼ E 0 G2 htaili ¼ V 0 : This completes the construction. Lemma 6. G has a vertex cover of size at most k if and only if G2 has an exemplar subsequence G02 with edit distance at most m þ 6n þ k from G1 .

BULTEAU AND JIANG: INAPPROXIMABILITY OF ð1; 2Þ-EXEMPLAR DISTANCE

1389

Fig. 3. Example for the reduction of ð1; 2Þ-EXEMPLAR EDIT DISTANCE to MINIMUM VERTEX COVER. Above: a cubic graph G with an optimal vertex cover fs; t; vg and the corresponding independent set fug. Below: the sequences G1 and G2 created from G, we use a common symbol S for all separators. An optimal exemplarization of G2 is underlined, and matched elements in this exemplarization are in bold font.

Proof. We first prove the direct implication. Let X be a vertex cover of G with jXj  k. Create G02 as follows: For each edge e ¼ fu; vg, at least one vertex, say u, is in X. Remove eu and retain ev in the edge gadget G2 hei, and correspondingly retain eu in the vertex gadget G2 hui and remove ev in the vertex gadget G2 hvi, then remove e in G2 hui and retain e in G2 hvi. We claim that the edit distance from G1 to G02 is at most m þ 6n þ k. It suffices to show that the Hamming distance of G1 and G02 is at most m þ 6n þ k since, for the edit distance that we consider, the cost of a substitution is 1. Observe that in both G1 and G02 , each edge gadget has length 1, and each vertex gadget has length 4. Thus, all gadgets are aligned and all separators are matched. The Hamming distance for each edge gadget is 1, so the total Hamming distance over all edge gadgets is m. The Hamming distance for each vertex gadget is at most 4. Moreover, for each vertex v 62 X (v incident to edges e; f; g), since the genes ev ; fv ; gv are removed (and the genes e; f; g are retained) in the vertex gadget, the gene v is matched, which reduces the Hamming distance by 1. Thus, the total Hamming distance over all vertex gadgets is at most 4n  ðn  jXjÞ ¼ 3n þ jXj. Finally, since the Hamming distance for the tail gadget is 3n, the overall Hamming distance of G1 and G02 is at most m þ 6n þ jXj  m þ 6n þ k. We next prove the reverse implication. Let G02 be an exemplar subsequence of G2 with edit distance at most m þ 6n þ k from G1 . Compute an alignment of G1 and G02 corresponding to the edit distance, then obtain the following three sets XE ðG02 Þ, XV ðG02 Þ, and XðG02 Þ: .

.

.

The set XE ðG02 Þ  E contains every edge e ¼ fu; vg such that either G02 hei contains both eu and ev , or G01 hei has an adjacent separator gene which is unmatched. The set XV ðG02 Þ  V contains every vertex v (v incident to edges e; f; g) such that either G02 hvi contains one of fev ; fv ; gv g, or G01 hvi has an adjacent separator gene (to its left), which is unmatched. The set XðG02 Þ  V is the union of XV ðG02 Þ and a set composed by arbitrarily choosing one vertex

from each edge in XE ðG02 Þ (thus jXðG02 Þj  jXV ðG02 Þj þ jXE ðG02 Þj). We first show that the edit distance from G1 to G02 is at least m þ 6n þ jXðG02 Þj. If a long separator (with m þ 7n genes) is completely unmatched, then the edit distance is at least m þ 7n  m þ 6n þ jXðG02 Þj. Hence, we can assume that there is at least one matched gene in each long separator. Consequently, the genes e; eu ; ev for all e 2 E and v01 ; v02 ; v03 for all v 2 V are unmatched. Consider an edge e ¼ fu; vg 2 E. If e 62 XE ðG02 Þ, then the edit distance for G1 hei is at least 1 since the gene e is unmatched. If e 2 XE ðG02 Þ, then consider the substring of G1 hei containing the gene e and the at most two separator genes adjacent to it (for the first edge gadget, there is only one separator gene adjacent to e, to its right). The edit distance for this substring is at least 2: the gene e is unmatched, and moreover either an adjacent separator gene is unmatched or an insertion is required. The total edit distance over all edge gadgets is at least m þ jXE ðG02 Þj. Consider a vertex v 2 V incident to three edges e; f; g. If v 62 XV ðG02 Þ, then the edit distance for G1 hvi is at least 3 since the genes v01 ; v02 ; v03 are unmatched. If v 2 XV ðG02 Þ, then consider the substring of G1 containing G1 hvi and the separator to its left. The edit distance for this substring is at least 4: the genes v01 ; v02 ; v03 are unmatched, and moreover at least one insertion is required unless either the gene v or the separator gene to its left is unmatched. The total edit distance over the vertex gadgets is at least 3n þ jXV ðG02 Þj. Finally, the edit distance over the tail gadget is at least the length of G1 htaili, which is 3n. Hence, the overall edit distance is at least m þ jXE ðG02 Þj þ 3n þ jXV ðG02 Þj þ 3n  m þ 6n þ jXðG02 Þj: Since the edit distance from G1 to G02 is at most m þ 6n þ k, it follows that jXðG02 Þj  k. To complete the proof, we show that XðG02 Þ is a vertex cover of G. Consider any edge e ¼ fu; vg. If e 2 XE ðG02 Þ, then, by our choice of XðG02 Þ, either u 2 XðG02 Þ or v 2 XðG02 Þ. Otherwise, if e 62 XE ðG02 Þ, then in the edge gadget G2 hei ¼ eu ev , at least one gene is removed to

1390

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,

obtain G02 hei. Assume that eu is removed, then the second copy, in G2 hui, is retained, and u 2 XV ðG02 Þ  XðG02 Þ. Likewise, if ev is removed, then v 2 XðG02 Þ. In summary, XðG02 Þ contains a vertex from every edge in E; hence, it is a vertex cover of G. u t The problem MINIMUM VERTEX COVER IN CUBIC GRAPHS is APX-hard (see, e.g., [1]). For a cubic graph G of n vertices and m edges, where 3n ¼ 2m, the minimum size k of a vertex cover is ðm þ nÞ. By Lemma 6, the exemplar edit distance of the two sequences G1 and G2 in the reduced instance is also ðm þ nÞ. Thus, by the standard technique of L-reduction, it follows that ð1; 2Þ-EXEMPLAR EDIT DISTANCE, when the cost of a substitution is 1 and the cost of an insertion or a deletion is at least 1, is APX-hard too. Then, the APX-hardness of ð1; 2Þ-EXEMPLAR LEVENSHTEIN DISTANCE and the APX-hardness of ð1; 2Þ-EXEMPLAR HAMMING DISTANCE follow as special cases. Moreover, since the lengths of the two sequences G1 and G2 in the reduced instance are both ðm þ nÞ as well, it follows that the complementary maximization problem ð1; 2Þ-EXEMPLAR HAMMING SIMILARITY is also APX-hard, if we define the Hamming similarity of two sequences of the same length ‘ as ‘ minus their Hamming distance.

6

CONCLUDING REMARKS

We find it most intriguing that although the problem ð1; 2Þ-EXEMPLAR DISTANCE has been shown to be APXhard for a wide variety of distance measures, including breakpoints, conserved intervals, common intervals, MAD, SAD, signed reversals, and DCJs, Levenshtein distance, Hamming distance..., no constant approximation is known for any one of these measures, while on the other hand, it seems difficult to improve the constant lower bound in any one of these APX-hardness results into a lower bound that grows with the input size similar to the logarithmic lower bound for MINIMUM SET COVER.

REFERENCES [1] [2] [3] [4]

[5]

[6]

[7]

P. Alimonti and V. Kann, “Some APX-Completeness Results for Cubic Graphs,” Theoretical Computer Science, vol. 237, pp. 123-134, 2000. S. Angibaud, G. Fertin, I. Rusu, A. The´venin, and S. Vialette, “On the Approximability of Comparing Genomes with Duplicates,” J. Graph Algorithms and Applications, vol. 13, pp. 19-53, 2009. P. Berman and K. Karpinski, “On some Tighter Inapproximability Results,” Proc. 26th Int’l Colloquium Automata, Languages and Programming, pp. 200-209, 1999. G. Blin, C. Chauve, G. Fertin, R. Rizzi, and S. Vialette, “Comparing Genomes with Duplications: A Computational Complexity Point of View,” Proc. IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, pp. 523-534, 2007. G. Blin, G. Fertin, F. Sikora, and S. Vialette, “The Exemplar Breakpoint Distance for Non-Trivial Genomes Cannot Be Approximated,” Proc. Third Workshop Algorithms and Computation (WALCOM ’09), pp. 357-368, 2009. P. Bonizzoni, G.D. Vedova, R. Dondi, G. Fertin, R. Rizzi, and S. Vialette, “Exemplar Longest Common Subsequence,” IEEE/ ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 4, pp. 535-543, Oct. 2007. X. Chen, “On Sorting Unsigned Permutations by Double-Cut-andJoins,” J. Combinatorial Optimization, pp. 1-13, 2010.

[8] [9]

[10] [11] [12] [13] [14] [15] [16] [17]

VOL. 10,

NO. 6,

NOVEMBER/DECEMBER 2013

Z. Chen, R.H. Fowler, B. Fu, and B. Zhu, “On the Inapproximability of the Exemplar Conserved Interval Distance Problem of Genomes,” J. Combinatorial Optimization, vol. 15, pp. 201-221, 2008. Z. Chen, B. Fu, and B. Zhu, “The Approximability of the Exemplar Breakpoint Distance Problem,” Proc. Second Int’l Conf. Algorithmic Aspects in Information and Management (AAIM ’06), pp. 291-302, 2006. I. Dinur and S. Safra, “On the Hardness of Approximating Minimum Vertex Cover,” Annals of Math., vol. 162, pp. 439-485, 2005. M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, 1979. S. Hannenhalli and P. Pevzner, “Transforming Cabbage into Turnip: Polynomial Algorithm for Sorting Signed Permutations by Reversals,” J. ACM, vol. 46, pp. 1-27, 1999. M. Jiang, “The Zero Exemplar Distance Problem,” J. Computational Biology, vol. 18, pp. 1077-1086, 2011. S. Khot and O. Regev, “Vertex Cover Might Be Hard to Approximate to within 2  ,” J. Computer and System Sciences, vol. 74, pp. 335-349, 2008. D. Sankoff, “Genome Rearrangement with Gene Families,” Bioinformatics, vol. 15, pp. 909-917, 1999. D. Sankoff and L. Haque, “Power Boosts for Cluster Tests,” Proc. RECOMB Int’l Workshop Comparative Genomics (RCG ’05), pp. 121130, 2005. S. Yancopoulos, O. Attie, and R. Friedberg, “Efficient Sorting of Genomic Permutations by Translocation, Inversion and Block Interchange,” Bioinformatics, vol. 21, no. 16, pp. 3340-3346, 2005. Laurent Bulteau received the MS degree in  computer science at Ecole Normale Supe´rieure, Paris, in 2009, and he is currently working toward the PhD degree at the University of Nantes. His research interests include computational complexity analysis and design of algorithms for problems issued from bioinformatics.

Minghui Jiang received the BS degree in physics from Peking University in 1997, the MS degree in physics and the MS degree in computer science from Purdue University in 1999, and the PhD degree in computer science from Montana State University in 2005. He is an associate professor of computer science at Utah State University. His research interests span many areas of theoretical computer science broadly connected to the design and analysis of algorithms, such as discrete and computational geometry, combinatorial optimization, and bioinformatics algorithms.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Inapproximability of (1,2)-exemplar distance.

Given two genomes possibly with duplicate genes, the exemplar distance problem is that of removing all but one copy of each gene in each genome, so as...
248KB Sizes 1 Downloads 0 Views