Bio-Medical Materials and Engineering 24 (2014) 1307–1314 DOI 10.3233/BME-130933 IOS Press

1307

A granular computing approach to gene selection Lin Sun ∗ and Jiucheng Xu College of Computer and Information Engineering, Henan Normal University, Xinxiang, China Engineering Technology Research Center for Computing Intelligence and Data Mining, Henan Province, China

Abstract. Gene selection is a key step in performing cancer classification with DNA microarrays. The challenges from high dimension and small sample size of microarray dataset still exist. On rough set theory applied to gene selection, many algorithms have been presented, but most are time-consuming. In this paper, a granular computing-based gene selection as a new method is proposed. First, some granular computing-based concepts are introduced and then some of their important properties are derived. The relationship between positive region-based reduct and granular space-based reduct is discussed. Then, a significance measure of feature is proposed to improve the efficiency and decrease the complexity of classical algorithm. By using Hashtable and input sequence techniques, a fast heuristic algorithm is constructed for the better computational efficiency of gene selection for cancer classification. Extensive experiments are conducted on five public gene expression data sets and seven data sets from UCI respectively. The experimental results confirm the efficiency and effectiveness of the proposed algorithm. Keywords: Feature selection, rough set theory, granular computing, granular space

1. Introduction Gene expression levels-based distinguishing classes of cancers are important for cancer diagnosis [1]. There are a large number of genes in the gene expression data sets, however only a few of them are essential for classification. The approaches to extracting relevant genes have become a key and hot issue for cancer diagnosis in recent years [2]. Feature selection is a useful technique in dealing with dimensionality reduction [3]. The existing approaches to feature selection for gene expression data can be generally classified as wrapper and filter methods [4]. Here, the filter method is used to select features according to significance measure. Identifying minimum gene subsets by discarding most noise and redundancy in dataset is a key challenge in gene expression profile-based cancer classification. The ability to handle imprecise and inconsistent information in real-world problems has become one of the most important requirements for feature selection [5]. A technique that can reduce dimensionality using information contained in dataset and preserve the meaning of the features is clearly desirable. Rough set theory has been successfully used in data mining such as classification and feature selection [3-5]. Its main idea is to reduce the redundancy of data by feature selection, while preserving the ability of classification. However, *

Corresponding author. E-mail: [email protected].

0959-2989/14/$27.50 © 2014 – IOS Press and the authors. All rights reserved

1308

L. Sun and J. Xu / A granular computing approach to gene selection

it is difficult to directly and effectively deal with real-valued features of microarray dataset. This highlights the importance of feature selection with particular emphasis on microarray data. Gene expression data set can be categorized usually as a decision system (table). In the last two decades, researchers have proposed some effective ways to apply rough sets into gene selection [6]. Rough set theory as a useful feature selection method in pattern recognition has been one of the most advanced areas popularizing granular computing [3-5]. The key concepts here are those of information granule and reducts [7]. Imprecision, uncertainty and partiality of truth are pervasive characteristics of the real world, which motivates the genesis of granular computing. Lin and Louie [8] presented fast algorithms using granular computing for finding association rules. In their presentation, however, generation of different levels of association rules was not considered, and how to store bit maps was not made very clear either. Qiu et al. [9] proposed an approach to hierarchical concepts based on granular computing from an information system with inaccurate or uncertain values. Despite their own merits, these methods are still inefficient and thus unsuitable for the reduction of voluminous data. Although the heuristic approaches above can avoid exponential computation in exhaustive methods, they still suffer from intensive computation of either discernibility functions or partitions of universe. Therefore, it is necessary to propose a heuristic feature selection algorithm with less time complexity and higher classification performance. Until now, few studies have been reported to construct granular computing approach for gene selection. This paper focuses on creating such a solution. In this paper, the main objective is to construct granular computingbased granular space, present an effective significance measure of feature to evaluate the dependency degree of knowledge, and improve computational efficiency of a heuristic algorithm for gene selection. 2. Preliminaries Formally, a decision system (DS) is a quadruple DS = (U, A = C ∪ D, V, f ), where U is a finite nonempty set of objects, C is the finite set of condition features, D is the finite set of decision features with C ∩ D = ∅, V is the union of feature domains such that V = ∪a∈A Va for Va denoting the value domain of feature a, and f : U × A → V is an information function. For any P ⊆ A, there is an indistinguishable relation IN D(P ) = {(u, v) ∈ U × U |∀a ∈ P , f (u, a) = f (v, a)}, which partitions U into some equivalence classes given by U/IN D(P ) = {[u]P |u ∈ U }, for simplicity, U/IN D(P ) will be replaced by U/P , where [u]P denotes the equivalence class (block) determined by u with respect to P , i.e., [u]P = {v ∈ U |(u, v) ∈ IN D(P )}. Each [u]P may be viewed as an information granule. Let DS = (U, C, D) be a decision system for short. For any P ⊆ C,  the positive and negative regions  are denoted by P OSP (D) = X∈U/D P X and N EGP (D) = U − X∈U/D P X. P is independent relative to D if P OSP (D) = P OSP −{a} (D) for any a ∈ P ⊆ C, and P OSP (D) = P OSC (D), then P is called a positive region-based reduct of C in the DS. Thus, the following theorem can be obtained. Theorem 1. Let DS be a decision system with P ⊆ C. Then P OS C (D) = ∪{X|X ∈ U/C ∧ ∀x, y ∈ X ⇒ f (x, d) = f (y, d), ∃d ∈ D} and P OSP (D) ⊆ P OSC (D). Definition 1. Let DS be a decision system with U = {U1 , U2 , · · · , Um }, U/(C ∪ D) = {[U  1 ]C∪D , [U2 ]C∪D , · · · , [Un ]C∪D }, where Ui ∈ U , i ≤ n, then U  = {U1 , U2 , · · · , Un }, and F  : U  × (C ∪ D) → V  . The DS is called a simplified decision system (SDS), denoted by SDS = (U  , C, D) for short. By virtue of the above technique of simplification, lots of redundancy information is deleted and the space complexity of the DS is decreased. Hence, the simplicity introduced is necessary. It is noted that the objects in U − P OSC (D) may be eliminated and then the searched space can be reduced. However, the inconsistent objects should play an important role in obtaining the reducts of decision systems. In such a case, the concepts of positive region and negative region object sets are presented as follows.

L. Sun and J. Xu / A granular computing approach to gene selection

1309

Definition 2. The D-positive region with respect to C in a SDS is defined as P OSC (D) = [Ui1 ]C ∪ [Ui2 ]C ∪ · · · ∪ [Uit ]C , where Uis ∈ U  , |[Uis ]C /D| = 1, 1 ≤ s ≤ t. UP OS = {Ui1 , Ui2 , · · · , Uit } is the    positive region object set of SDS, and UN EG = U − UP OS is the negative region object set of SDS. Theorem 2. Let SDS be a simplified decision system with P ⊆ C. Then UP OS ⊆ P OSC (D) and P OSP (D) = ∪{X|X ∈ U  /P, X ∈ UP OS ∧ |X/D| = 1}. 3. Granular computing-based feature selection approach 3.1. Information granule and granular space An information granule (IG) is denoted by the tuple IG = (ϕ, m(ϕ)), where ϕ refers to the intension of IG and m(ϕ) represents the extension of IG [9]. In a SDS, for any B = {a1 , a2 , · · · , ak } ⊆ C, let Vai = {Vai,1 , Vai,2 , · · · , Vai,k } be the domain of feature ai , and each Vai,j be viewed as a concept. The intension of IG can be denoted by ϕ = {I1 , I2 , · · · , Ik }, and the extension of IG can be denoted by m(ϕ) = {u ∈ U  |f (u, a1 ) = I1 ∧ f (u, a2 ) = I2 ∧ · · · ∧ f (u, ak ) = Ik , ai ∈ B, i ∈ {1, 2, · · · , k}}. Definition 3. Let SDS be a simplified decision system with IG = (ϕ, m(ϕ)). If ϕ = {Vai,j }, then IG is called an elementary granule of ai . Namely, m(ϕ) = {u ∈ U  |f (u, ai ) = Vai,j , ai ∈ C ∪ D}. Definition 4. Let SDS be a simplified decision system. I = {I1 , I2 , · · · , Ik } is called a k-itemset, where Ii ∈ Vai is a feature value of ai ∈ B ⊆ C. Then, IG = (I, m(I)) is a k-itemset granule, where m(I) = {u ∈ U  |f (u, a1 ) = I1 ∧ f (u, a2 ) = I2 ∧ · · · ∧ f (u, ak ) = Ik , ai ∈ B, i ∈ {1, 2, · · · , k}}. Definition 5. Let SDS be a simplified decision system. Its granular space (GS) can be defined as a quadruple GS = (U  , S = {(ϕn , m(ϕn ))|n ∈ I + }, {Ri |i ∈ I + }, T ), where U  is the universe of SDS, S denotes the set of involved information granules, Ri denotes the relations between information granules and I + the set of positive integers, and T = (ϕn , m(ϕn )) × Ri × (ϕn , m(ϕn )) is called a function carrying on the junction between two concepts and binary relations. It is noted that the set of all granules constructed from the family of elementary granules is normally a superset, which is typically a subset of the power set of U  . It also requires that the union of all the elementary granules covers the universe U  . Then, different levels of relations among concepts induce a hierarchical structure called a granular space. Furthermore, there exists a map between S in GS and U  , denoted by gs : S → U  , such that gs(IGi ) = m(ϕ) for any IGi in S. Definition 6. Let SDS be a simplified decision system with granular space GS. The object set from P OSC (D) in the SDS is regarded as a granular space of positive region, denoted by GSP . The object set from N EGC (D) in the SDS is regarded as a granular space of negative region, denoted by GSN . From Definitions 2 and 6, the following theorems can be obtained immediately.  = UP OS and Theorem 3. Let SDS be a simplified decision system with granular space GS. UGS P     UGSN = UN EG , if UGSP = ∪{gs(IG)|IG ∈ GSP } and UGSN = ∪{gs(IG)|IG ∈ GSN }.  = gs(S) for S Theorem 4. Let SDS be a simplified decision system with granular space GS. If UGS P  in GS or UGSN = ∅, then the SDS is consistent, otherwise it is inconsistent. Theorem 5. Let SDS be a simplified decision system with granular space GS. SDS is consistent iff for any IG ∈ S, there exists x ∈ gs(IG) ∧ y ∈ gs(IG) ⇒ f (x, d) = f (y, d), where x, y ∈ U  , d ∈ D. Definition 7. Let SDS be a simplified decision system with granular space GS, P ⊆ C, IGi ∈ GSP . There exist IGj ∈ GSP and IGi = IGj such that ϕi of IGi and ϕj of IGj have the equal sets of condition feature values with different decision values, then IGi is called a conflict granule of GSP . There exist IGj ∈ GSN and IGi = IGj such that ϕi and ϕj have the equal sets of condition feature values, then IGi is called a conflict granule of GSP , otherwise, it is a non-conflict granule of GSP .

1310

L. Sun and J. Xu / A granular computing approach to gene selection

3.2. Significance measure of feature and feature selection algorithm The classical methods of feature selection only distinguish one class from the remaining classes but not distinguish one class from another [10]. To address this issue, the following propositions are presented. Definition 8. Let SDS be a simplified decision system with P ⊆ C. The D-positive region with respect  to P is defined as P OSP (D) = ∪{X|X ∈ gs(S)/P ∧ X ⊆ UGS }. P   Lemma 1. For any P ⊆ C in a SDS, if P OSP (D) = UGSP , then P OSP (D) = P OSC (D). Proof. Suppose P ⊆ C, by Theorem 1, P OSP (D) ⊆ P OSC (D). Suppose that P OSP (D) = P OSC (D), there exists ul ∈ P OSC (D) such that ul ∈ P OSP (D). Then [ul ]C ⊆ P OSC (D) such that [ul ]C ⊂ P OSP (D). It follows from Definition 6 that [ul ]C ⊆ P OSC (D) ⇒ IGi ∈  GSP such that [gs(IGi )]C = [ul ]C and gs(IGi ) ∈ UGS . Thus, gs(IGi ) ∈ P OSC (D) and P gs(IGi ) ∈ P OSP (D). Since gs(IGi ) ∈ P OSP (D), there exists gs(IGl ) ∈ [gs(IGi )]P such that f (gs(Grl ), d) = f (gs(Gri ), d), where d ∈ D. One has that {gs(IGl ), gs(IGi )} ⊆ [gs(IGi )]P such that |[gs(IGi )]P /D| = 1. Since [gs(IGi )]P ⊆ gs(S)/P = U  /P , from Definition 8, [gs(IGi )]P ⊂ P OSP (D) such that gs(IGi ) ∈ P OSP (D). Thus it is concluded from Theorem 3 that P OSP (D) =   UGS , which contradicts the fact that P OSP (D) = UGS . Therefore, P OSP (D) = P OSC (D) holds. P P Theorem 6. Let SDS be a simplified decision system with granular space GS, a ∈ P ⊆ C. P is a posi tive region-based reduct of C relative to D, if P OSP (D) = UGS and P OSP (D) = P OSP −{a} (D). P   Proof. Suppose a ∈ P , by Theorem 1, P OSP −{a} (D) ⊆ P OSP (D). There exists ul ∈ P OSP (D) =  such that ul ∈ P OSP −{a} (D) since P OSP (D) = P OSP −{a} (D). Then [ul ]P −{a} = Xi ∈ UGS P  . There exists uk ∈ Xi gs(S)/(P −{a}). It follows from Definitions 6 and 8 that [ul ]P −{a} = Xi ⊂ UGS P  such that uk ∈ gs(IGt ) ∈ UGS for any IGt ∈ S in GS. By Theorem 4, there exist x, y ∈ [uk ]C such N that x ∈ gs(IGt ) ∧ y ∈ gs(IGt ) ⇒ f (x, d) = f (y, d), where d ∈ D. Then [uk ]C ⊆ [gs(IGt )]C ⊆   UGS and [uk ]C ⊂ UGS . It follows from Definition 6 and Theorem 3 that [uk ]C ⊂ P OSC (D). Since N P P − {a} ⊆ P ⊆ C, then P OSP −{a} (D) ⊆ P OSP (D) ⊆ P OSC (D), i.e., [uk ]C ⊂ P OSP −{a} (D). Thus [uk ]P −{a} = [ul ]P −{a} when uk ∈ Xi = [ul ]P −{a} . Since P ⊆ C, then P − {a} ⊆ C and [uk ]C ⊆ [ul ]P −{a} . There exists [ul ]P −{a} ⊂ P OSP −{a} (D) such that ul ∈ P OSP −{a} (D). Since   ul ∈ UGS , from Theorem 3 and Definition 2, ul ∈ P OSC (D). Since P OSP (D) = UGS , by Lemma P P  1, P OSP (D) = P OSC (D). Then ul ∈ P OSP (D). It follows that P OSP −{a} (D) = P OSP (D), that is, any a ∈ P is necessary for D. Therefore, P is a positive region-based reduct of C relative to D. Theorem 7. gs(IG)/(P ∪ {a}) = ∪{X/{a}|X ∈ gs(IG)/P } for any P ⊆ C, a ∈ C − P , IG ∈ S. Definition 9. For any P ⊆ C in a SDS, S in GS, the significance measure of feature a∈ C−P is defined  as SIGP (a) = |UP ∪{a} − UP |, where UP = { X} ∪ { X}. UP

UP −{a}

 X∈gs(S)/P ∧X⊆UGS

P

 X∈gs(S)/P ∧X⊆UGS

N

Theorem 8. If = gs(S) and ⊂ gs(S), where any a ∈ P ⊆ C in a SDS and S in GS, then P is a granular space-based reduct of C relative to D. Proof. Suppose P ⊆ C, by Theorem 1, P OSP (D) ⊆ P OSC (D). Since a ∈ P , P − {a} ⊆ P , then P OSP −{a} (D) ⊆ P OSP (D) similarly. Suppose gs(IGi ) ∈ P OSC (D) for any IGi ∈ S, then  or [gs(IGi )]P ⊆ [gs(IGi )]C ⊆ [gs(IGi )]P . Since UP = gs(S), from Definition 9, [gs(IGi )]P ⊆ UGS P     , i.e., UGSN . If [gs(IGi )]P ⊆ UGSN , there exists [gs(IGi )]C ⊆ UGSN such that [gs(IGi )]C ⊂ UGS P   gs(IGi ) ∈ P OSC (D), which contradicts the fact that gs(IGi ) ∈ P OSC (D). Thus [gs(IGi )]P ⊆  UGS . From Definition 8, gs(IGi ) ∈ P OSP (D) and P OSC (D) ⊆ P OSP (D). Then P OSP (D) = P P OSC (D). Since UP −{a} ⊂ gs(S), there exists IGl ∈ S such that gs(IGl ) ∈ UP −{a} . From Definition  } and gs(IGl ) ∈ ∪{X|X ∈ gs(S)/(P −{a})∧ 9, gs(IGl ) ∈ ∪{X|X ∈ gs(S)/(P −{a})∧X ⊆ UGS P

1311

L. Sun and J. Xu / A granular computing approach to gene selection

 X ⊆ UGS }. There exists IGj ∈ S such that gs(IGl ) ∈ [gs(IGj )]P −{a} . Since [gs(IGj )]P −{a} ⊂ N  UGSP , from Definition 8, [gs(IGj )]P −{a} ⊂ P OSP −{a} (D). Then gs(IGl ) ∈ [gs(IGj )]P −{a} ⊂   . There exists IGk ∈ S such that gs(IGk ) ∈ [gs(IGj )]P −{a} and gs(IGk ) ⊆ UGS . Thus, UGS N P gs(IGk ) is the object of positive region in the SDS, i.e., gs(IGk ) ∈ P OSC (D). Then [gs(IGk )]C ⊆  UGS , and gs(IGk ) ∈ P OSC (D). Since [gs(IGk )]P −{a} = [gs(IGj )]P −{a} , it follows that gs(IGk ) ∈ P  P OSP −{a} (D) such that P OSC (D) = P OSP −{a} (D), i.e., P OSP (D) = P OSP −{a} (D). Thus, any a ∈ P is necessary for D. Therefore, P is a granular space-based reduct of C relative to D. Theorem 9. Let SDS be  a simplified decision  system with P ⊆ C, any  a ∈ C − P . Then SIGP (a) = |UP ∪{a} − UP | = | {{ Y}∪{ Y }}|. X∈(U  −UP )/P

 Y ∈X/{a}∧Y ⊆UGS

 Y ∈X/{a}∧Y ⊆UGS

P

Proof. It follows immediately from Definition 9 that =

UP ∪{a}  X∈gs(S)/P

={

 X⊆UGS



∪{{

 X⊆UGS



{

={



 X⊂UGS

{{ {



N

P

 Y ∈X/{a}∧Y ⊆UGS





X} ∪ {

 X⊆UGS

 Y ∈X/{a}∧Y ⊆UGS



X} ∪ {

P

Y }} ∪ { 

P

Y }} ∪ {

 X⊂UGS



N

 X⊆UGS

P 

 X∈gs(S)/P ∧X⊆UGS

∪{



{

 Y ∈X/{a}∧Y ⊆UGS

N

X∈gs(S)/P

∪{



{{{

 X⊂UGS

{

N



N

 X∈gs(S)/P ∧X⊆UGS

=| =|

  X∈gs(S)/P ∧X⊂UGS ∧X⊂UGS P

 Y ∈X/{a}∧Y ⊆UGS

Y }}} ∪ {

P

  X∈gs(S)/P ∧X⊂UGS ∧X⊂UGS



X∈(U  −UP )/P

{{

P





X∈gs(S)/P

{{

Y }}}}

= |{{ {

P

 Y ∈X/{a}∧Y ⊆UGS

N

 Y ∈X/{a}∧Y ⊆UGS

 X∈gs(S)/P ∧X⊂UGS

{{

N

 Y ∈X/{a}∧Y ⊆UGS



P

Y }} ∪ { 

P

{

N

 Y ∈X/{a}∧Y ⊆UGS

Y}∪{

 X⊂UGS

P



{ P

 Y ∈X/{a}∧Y ⊆UGS

 X∈gs(S)/P ∧X⊂UGS

Y }}.



N

Y }}}



X} ∪ {

{

N

Y }}}}

 Y ∈X/{a}∧Y ⊆UGS

  X∈gs(S)/P ∧X⊂UGS Y ∈X/{a}∧Y ⊆UGS N N   |UP ∪{a} − UP |  





{



N

X}}} ∪ {



{



N



P

 Y ∈X/{a}∧Y ⊆UGS

P

P

  Y ∈X/{a}∧Y ⊆UGS

Y }} P

   X∈gs(S)/P ∧X⊂UGS ∧X⊆UGS

 Y ∈X/{a}∧Y ⊆UGS

Y}∪{

{

Y }}

P

Y }}|

N 

 Y ∈X/{a}∧Y ⊆UGS

Y }}|.

N

Y }}|

N

N

Significance measure of feature is usually defined in terms of dependency degree. If an IG = (ϕ, m(ϕ)) is a non-conflict granule contained by GSP , or a conflict granule with respect to C is contained by GSN , then SIGP (a) = 0. Thus, SIGP (a) can describe the dependency degree of a on P . In order to further reduce the complexity of feature selection, decomposed algorithms are introduced as follows. Algorithm 1. calculate positive region and negative region object sets Input: DS = (U, C, D), U = {u1 , u2 , · · · , um }, P ⊆ C, D = {d}, and P = {a1 , a2 , · · · , an }  Output: UP OS , UN EG , and U/P   Initialize: UP OS = UN EG = ∅, Count1 = Count2 = 0, and Hashtable H, where hi ∈ H, hi .count = 0, hi .cons = true Step-1: for j = 1 to m Step-2: { select uj ∈ U , and let hi = hi ∪hash(P (uj )), where hash(P (uj )) denotes a coded computing

1312

L. Sun and J. Xu / A granular computing approach to gene selection

function of H and notes the union sets of all f (uj , ak ) for any ak ∈ P , 1 ≤ k ≤ n Step-3: hi .count = hi .count + 1 Step-4: if f (hi , d) = f (uj , d) then hi .cons = false and Count1 = Count1 + 1 Step-5: if hi .cons = true then Count2 = Count2 + hi .count } Step-6: for s = 1 to |Count2| Step-7: {if hi .cons = true then UP OS = UP OS ∪ hi Step-8: H = H − hi } Step-9: for s = 1 to |Count1|   Step-10: if hi .cons = false then UN EG = UN EG ∪ hi Step-11: Return U/P consisting of all hi in H Algorithm 2. acquire granular space Input: DS = (U, C, D), C = {c1 , c2 , · · · , cn }, and D = {d} Output: S, GSP , and GSN Initialize: S = GSP = GSN = ∅, and n = |C| Step-1: calculate U/(C ∪ D) with Algorithm 1 to obtain U  Step-2: for i = 1 to |U  /C| Step-3: {calculate [Ui ]C /D with Algorithm 1, where [Ui ]C ∈ U  /C, and l = |[Ui ]C /D| Step-4: for j = 1 to |[Ui ]C | Step-5: {select uj ∈ [U i ]C , and calculate its corresponding IG = (ϕ, m(ϕ)), where ϕ = {I1 , I2 , · · · , In }, m(ϕ) = {u ∈ U  |f (u, c1 ) = I1 ∧ f (u, c2 ) = I2 ∧ · · · ∧ f (u, cn ) = In , ct ∈ C, 1 ≤ t ≤ n} Step-6: S = S ∪ {IG} Step-7: if l = 1 then GSP = GSP ∪ {IG} else GSN = GSN ∪ {IG}}} Algorithm 3. calculate significance measure of feature Input: SDS = (U  , C, D), P ⊆ C, a ∈ C − P , S, GSP and GSN calculated with Algorithm 2 Output: SIGP (a) Initialize: R = GP = GN = ∅   and UGS from GSP and GSN for UP , and (U  − UP )/P with Step-1: calculate gs(S)/P , UGS P N Algorithm 1 Step-2: for i = 1 to |(U  − UP )/P | Step-3: {select Xi ∈ (U  − UP )/P , and calculate the partitions Xi /{a} with Algorithm 1 Step-4: for j = 1 to |Xi /{a}| Step-5: {select Yj ∈ Xi /{a} to obtain R = R ∪ Yj  Step-6: if Yj ∈ UGS then GP = GP ∪ Yj else GN = GN ∪ Yj }} P   Step-7: (U − UP )/(P ∪ {a}) = R, and SIGP (a) = |GP ∪ GN | Algorithm 4. granular computing-based feature selection algorithm (GCFSA) Input: SDS = (U  , C, D), D = {d} Output: set P of selected features Step-1: calculate S, GSP and GSN with Algorithm 2, and P = ∅ Step-2: calculate UP , (U  − UP )/P , (U  − UP )/(P ∪ {a}), GP , and GN with Algorithm 3 Step-3: calculate SIGP (a) (a ∈ C − P ) with Algorithm 3, construct an input sequence of features in a descending order by SIGP (a), select the front a in turn, and let P = P ∪ {a} and U  = U  − GP − GN Step-4: if U  = ∅ then turn to Step-3 Step-5: s = |P | Step-6: for i = 1 to s

L. Sun and J. Xu / A granular computing approach to gene selection

1313

Step-7: {select ai ∈ P from the end to the beginning in P Step-8: if |P OSC (D)| = |P OSP −{ai } (D)|, i.e., SIGP (ai ) = 0, then P = P − {ai }} The time complexity of Algorithm 1 with Hashtable is O(|U |), and that of Algorithm 2 is O(|U  /C| |U  |) since |C| is the circle maximum. By using GCFSA, the time complexity is polynomial. At Step-1, the time complexity is O(|U | + |U  /C||U  |), and that of Step-2 is O(|C − P ||U  − UP |). Thus, the time complexity of Step-3 with an input sequence and Step-4 is O(|C||U  |) + O((|C| − 1)|U  − UP 1 |) + O((|C|−2)|U  −UP 2 |)+· · ·+O(|C −Pk ||U  −UP k |), where Pk is a reduct of the SDS. Furthermore, the aim of Step-6 to Step-8 is to ensure the completeness of GCFSA, and its time complexity is O(|C||U  |). Obviously, the total time complexity of GCFSA is O(|U | + |U  /C||U  |) + O(|C − P ||U  − UP |) + O(|C||U  |)+O((|C|−1)|U  −UP 1 |)+O((|C|−2)|U  −UP 2 |)+· · ·+O(|C−Pk ||U  −UP k |)+O(|C||U  |). Thus, the time complexity of GCFSA is close to O(|C|2 |U  |), i.e., O(|C|2 |U/C|). Furthermore, the space complexity of GCFSA is O(|C||U |). 4. Experimental results The Rosetta and Weka softwares in reference [7] are used in the experiments. Different algorithms available in Rosetta and Weka are respectively utilized for discretizing the training and test datasets, feature selection, and data classification. In the first part of our experiments, five well known gene expression data sets shown in Table 1 are used to evaluate the performances of our approach, which are the same data sets used in many publications for gene selection and cancer classification. The decision rules extracted from the reduced training sets are employed as a classifier for the test sets, and the 10-fold cross-validation method is applied to estimate the forecast accuracy with respect to the reduct generated by GCFSA. The classification results are outlined in Table 1, from which it can be seen that the forecast accuracies of five public microarray data sets are all above 91%, and the performances of Lung Cancer and MLL_Leukemia are obviously better than those of the others. What’s more, as increasing with the number of genes, higher forecast accuracies can be achieved on the test of gene expression data sets. Table 1 Comparison of features selected and classification results Dataset Colon Tumor ALL-AML Leukemia Central Nervous System Lung Cancer MLL_ Leukemia

Features 2000 7069 7129 12533 12582

Samples 62 72 60 181 72

Feature selected 7 5 7 7 6

Time(s)

A granular computing approach to gene selection.

Gene selection is a key step in performing cancer classification with DNA microarrays. The challenges from high dimension and small sample size of mic...
152KB Sizes 0 Downloads 0 Views