FIFS: A data mining method for informative marker selection in high dimensional population genomic data.

Accepted Manuscript FIFS: A data mining method for informative marker selection in high dimensional population genomic data Ioannis Kavakiotis, Patroklos Samaras, Alexandros Triantafyllidis, Ioannis Vlahavas PII:

S0010-4825(17)30318-9

DOI:

10.1016/j.compbiomed.2017.09.020

Reference:

CBM 2791

To appear in:

Computers in Biology and Medicine

Received Date: 2 August 2017 Revised Date:

29 August 2017

Accepted Date: 26 September 2017

Please cite this article as: I. Kavakiotis, P. Samaras, A. Triantafyllidis, I. Vlahavas, FIFS: A data mining method for informative marker selection in high dimensional population genomic data, Computers in Biology and Medicine (2017), doi: 10.1016/j.compbiomed.2017.09.020. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

2 3 4

FIFS: A Data Mining Method for Informative Marker Selection in High Dimensional Population Genomic Data

RI PT

1

5

Ioannis Kavakiotisa,b, Patroklos Samarasa, Alexandros Triantafyllidisb, Ioannis

6

Vlahavasa

7

a

8

b

9

University of Thessaloniki, 54124, Greece

SC

School of Informatics, Aristotle University of Thessaloniki, 54124, Greece

10

M AN U

Department of Genetics, Development and Molecular Biology, School of Biology, Aristotle

Correspondence: Ioannis Kavakiotis

12

Email: [email protected]

13

Address: School of Informatics, Aristotle University of Thessaloniki, 54124, Greece

14

Tel: +30-2310-0998145

15

Fax: +30-231-0998362

EP

AC C

16

TE D

11

17

Abstract

18

Background and Objective

19

Single Nucleotide Polymorphism (SNPs) are, nowadays, becoming the marker of choice for

20

biological analyses involving a wide range of applications with great medical, biological, 1

ACCEPTED MANUSCRIPT economic and environmental interest. Classification tasks i.e. the assignment of individuals

22

to groups of origin based on their (multi-locus) genotypes, are performed in many fields

23

such as forensic investigations, discrimination between wild and/or farmed populations and

24

others. Τhese tasks, should be performed with a small number of loci, for computational as

25

well as biological reasons. Thus, feature selection should precede classification tasks,

26

especially for Single Nucleotide Polymorphism (SNP) datasets, where the number of

27

features can amount to hundreds of thousands or millions.

28

Methods

29

In this paper, we present a novel data mining approach, called FIFS – Frequent Item Feature

30

Selection, based on the use of frequent items for selection of the most informative markers

31

from population genomic data. It is a modular method, consisting of two main components.

32

The first one identifies the most frequent and unique genotypes for each sampled

33

population. The second one selects the most appropriate among them, in order to create

34

the informative SNP subsets to be returned.

35

Results

36

The proposed method (FIFS) was tested on a real dataset, which comprised of a

37

comprehensive coverage of pig breed types present in Britain. This dataset consisted of 446

38

individuals divided in 14 sub-populations, genotyped at 59,436 SNPs.

39

outperforms the state-of-the-art and baseline methods in every case. More specifically, our

40

method surpassed the assignment accuracy threshold of 95% needing only half the number

41

of SNPs selected by other methods (FIFS: 28 SNPs, Delta: 70 SNPs Pairwise FST: 70 SNPs, In:

42

100 SNPs.)

AC C

EP

TE D

M AN U

SC

RI PT

21

2

Our method

ACCEPTED MANUSCRIPT Conclusion

44

Our approach successfully deals with the problem of informative marker selection in high

45

dimensional genomic datasets. It offers better results compared to existing approaches and

46

can aid biologists in selecting the most informative markers with maximum discrimination

47

power for optimization of cost-effective panels with applications related to e.g. species

48

identification, wildlife management, and forensics.

49

Keywords

50

Bioinformatics, Machine Learning, Data Mining, Feature Selection, Frequent Pattern Mining,

51

Single Nucleotide Polymorphism, Population genomics, Ancestry Informative Marker, Big

52

Data

SC

1. Introduction

M AN U

53

RI PT

43

Significant advances in biotechnology and more specifically high - throughput technologies

55

have made big data production feasible, even for small laboratories, and have enabled

56

researchers to become big data users by accessing public repositories such as EBI [1] or the

57

international HapMap consortium [2]. Terabyte sized datasets are now common in sciences

58

including biology [3] and new services should be developed to facilitate their utilization. It is

59

evident that machine learning, data mining and knowledge discovery in biological data will

60

play an important role in upcoming scientific discoveries.

61

Data mining and machine learning constitute computer science subfields strongly related to

62

artificial intelligence, statistics, mathematics and database technology. Those subfields offer

63

sophisticated techniques, producing robust and accurate results when applied to a wide

64

range of sciences. Nowadays, the continuous progress of high throughput sequencing

AC C

EP

TE D

54

3

ACCEPTED MANUSCRIPT technologies has enabled the genotyping of thousands or even millions of SNPs (Single

66

Nucleotide Polymorphisms) and consequently the production of very large – high

67

dimensional SNPs dataset for model as well as non-model organisms.

68

SNPs are rapidly becoming the marker of choice for a wide range of organisms and biological

69

analyses [4] and SNP data analysis is gaining popularity in various applications with great

70

medical, biological and economic interest, such as identification of disease related

71

mutations, Quantitative Trait Loci analyses (QTL), food traceability, brand authentication,

72

discrimination between wild and/or farmed populations and anthropological forensic

73

investigations [5].

74

Several of the above-mentioned applications, require the assignment of individuals, or

75

groups of individuals, to their population of origin, based on their (multi-locus) genotypes

76

[6]. This is a classification task, performed with specialized classifiers for genomic data. As

77

has been reported many times [5, 7, 8] it is important to create minimal panels (i.e.

78

datasets) with maximum discrimination power, i.e. to reduce dataset dimensionality. To

79

achieve this, researchers should select, amongst all genotyped loci (SNPs), those SNPs that

80

can better discriminate analyzed population samples. In terms of machine learning and data

81

mining, this is a feature selection task, while for population genetics it is an informative

82

marker selection task.

83

Informative markers selection is a fairly classic task for population genetics, and was mainly

84

performed with microsatellites in previous decades. The production of SNP datasets with an

85

incomparably higher dimensionality has revealed many drawbacks of the existing

86

approaches and new opportunities, to be discussed later. Still, no current method can be

AC C

EP

TE D

M AN U

SC

RI PT

65

4

ACCEPTED MANUSCRIPT broadly accepted as the most successful, because none outperforms the others in all

88

circumstances.

89

The present work proposes a novel data mining approach, called FIFS – Frequent Item

90

Feature Selection and is based on the notion of frequent items for selecting the most

91

informative markers from population genomics data. It is a modular method, consisting of

92

two main components. The first one identifies the most frequent and unique genotypes for

93

each sampled population. The second one selects the most appropriate among them, in

94

order to create the informative SNP subsets to be returned.

95

FIFS significantly increases the performance of existing methods, achieving better

96

assignment accuracy in any comparison. Moreover, its modular nature facilitates the reuse

97

of many intermediary results making the whole approach extremely efficient. FIFS is

98

implemented in JAVA and therefore can be executed in all operating systems.

99

application and the user manual are available on application’s website:

The

TE D

M AN U

SC

RI PT

87

http://intelligence.csd.auth.gr/bioinformatics/fifs/

101

This paper is organized as follows. The motivation behind this work is described in section 2.

102

Section 3 presents the biological background knowledge concerning SNPs and respective

103

datasets. Section 4 presents related work and existing approaches for the informative

104

marker selection task. In section 5 all necessary terms, equations and methods used in our

105

experimental setup, are presented. Section 6 is dedicated to the detailed description of our

106

approach. Later on, the experimental process is described in section 7. Results of the

107

experiments conducted to compare our new methods with the state-of-the-art and baseline

108

methods are presented in section 8 and the paper is concluded in section 9.

AC C

EP

100

5

ACCEPTED MANUSCRIPT 109

2. Motivation The importance of feature selection for SNP datasets is beyond any dispute either in biology

111

or machine learning. From a biological point of view, the importance of feature selection,

112

i.e. selecting those SNPs with the maximum information power, has been stressed in several

113

scientific projects and papers [5,7,8] and is principally economic. Although valuable, full

114

genome wide data are costly to produce. Smaller panels (i.e. datasets) are faster, cheaper

115

and more flexible and thus facilitate the genotyping of several hundreds of individuals with

116

an incomparably lower cost than the cost of genome wide genotyping.

117

From a data mining perspective, feature selection has multiple beneficial effects on the

118

prediction algorithms. Firstly, It can improve the prediction performance through defying

119

the curse of dimensionality. Moreover, it facilitates computational performance and data

120

understanding through easier data visualization [9]. In case of large SNP datasets, a feature

121

selection process is therefore essential.

122

However, the identification of these (minimum) SNPs has become problematic lately, due to

123

the dimensionality of SNP datasets. Problems are mainly associated with computer science

124

issues and more specifically with the handling, processing and analysis of big data. Existing

125

prominent applications have various computational limitations. All of them have been used

126

to analyze microsatellite datasets, with incomparably lower dimensionality, containing up to

127

100 features/loci. On the other hand, SNP datasets can contain hundreds of thousands loci

128

in animal/plant organisms or even millions in human datasets. The analysis of these datasets

129

with a desktop computer may be prohibitive and, for this reason, impossible. All related

130

problems and issues are discussed later on and are also thoroughly reviewed by [4].

AC C

EP

TE D

M AN U

SC

RI PT

110

6

ACCEPTED MANUSCRIPT Furthermore, considering that SNP selection is a task to be performed mainly by biologists,

132

possessing few or no programming skills, it is crucial that proposed methods are not

133

computationally intensive and can be run, if possible, on a desktop computer.

134

3. Background Knowledge

135

3.1

Single Nucleotide Polymorphisms – SNPs

RI PT

131

Single nucleotide polymorphism is a type of genetic variation and occurs when a single

137

nucleotide (A, T, G or C) in the genome differs between members of a biological species or

138

paired chromosomes. For example, in figure 1 three genotyped DNA sequences are

139

depicted. Those three differ in one nucleotide in the sixth and eleventh position. Those

140

mutations can, in some cases, be harmful and cause various diseases (such as cancer or

141

diabetes), or be harmless especially when they are located in non-coding regions i.e. not in

142

genes. The best-known SNP is the common mutation causing sickle cell anaemia.

AC C

143

EP

TE D

M AN U

SC

136

144

Fig. 1. Three sequences containing two SNPs. Each SNP presents two alleles.

145

SNPs can also affect how organisms respond to drugs, vaccines and pathogens. An allele is

146

one of the possible alternative forms of the same gene or genetic locus. In SNP studies, the

147

existence of two alleles is the most common case. A marker possessing only two alleles is

148

called biallelic.

7


3.2

Population Genomic Datasets – SNP Datasets

SNP datasets can be found in many file formats. PED files [10], HapMap files [2], Variant Call

151

Format (VCF) and GENEPOP [11] are some of the most commonly used. The dimensionality

152

of SNP datasets can vary a lot. Animal/plant datasets can usually reach a hundred thousand

153

attributes (SNPs), whereas human datasets can be over a million SNPs big. In SNP datasets,

154

each attribute is a biallelic marker i.e. corresponds to the variation between two

155

nucleotides. Since most plant and animal organisms are diploid receiving one set of

156

chromosomes from a male and one from a female parent, the genotype of one individual

157

(i.e. the combination of the two parent alleles) can thus can have at most three values,

158

occuring from the combination of two nucleotides: Two are homozygous and one is

159

heterozygous. For instance, one SNP can present the following values AA, GG and AG (GA

160

and AG are essentially the same). The term allele frequency refers to the frequency of each

161

allele at a specific SNP, in a population. For instance, consider a population containing five

162

individuals. Assume a SNP with two alleles (T and C) corresponding to three possible

163

genotypic combinations (TT, CC, TC). If three individuals have TT genotypes, one CC and one

164

TC, then the allele frequencies are 0.7 for the T allele and 0.3 for the C allele.

SC

M AN U

TE D

EP

4. Related Work

AC C

165

RI PT

150

166

In this section, we present the necessary background knowledge for the feature selection

167

task and link it to available existing population genetics methods for selecting informative

168

markers.

169

4.1

Feature Selection

170

Feature selection methods are divided in two major categories. The first category comprises

171

of “filter” methods, which evaluate the attributes based on general data characteristics . 8

ACCEPTED MANUSCRIPT The second category contains “wrapper” methods. Wrapper methods use machine learning

173

algorithms to evaluate and finally decide which is the most appropriate candidate subset of

174

features [12]. The main advantage of wrappers is that they commonly, (though not in

175

population genetics, as stated later), offer better classification accuracy. The main

176

disadvantage of wrapper methods is that the algorithm has to build a model many times, in

177

order to evaluate different subsets of features, which in some cases, such as in Support

178

Vector Machines (SVMs), is computationally expensive. This is an important drawback

179

especially for SNP datasets whose dimensionality can be extremely high. On the other hand,

180

the main advantage of filters is that they are much faster than wrapper methods and the

181

only available solution probably, when dealing with big data.

182

In case of wrappers, the key task of the whole selection process, which affects the

183

computational performance of the whole method, is the method of searching the attributes’

184

space, in order to select the subset with the best classification accuracy. The number of

185

possible attribute subsets increases exponentially with the number of attributes. For an

186

exhaustive search of the feature space, up to 2m possible subsets have to be examined,

187

where m is the number of attributes. Thus, it is obvious that it can only be used on datasets

188

containing few features.

189

Due to the computational costs of exhaustive searches, greedy approaches are often

190

adopted to search within the attribute space. The two proposed ways are opposite, in

191

general, but operate in a completely analogous way. At each step, the attribute subset

192

changes by adding or deleting a single attribute. More specifically, the first approach, called

193

forward selection, begins with a subset of zero attributes. The aim, at each step, is to select

194

the best attribute to add to the existing subset. To achieve this, every single remaining

AC C

EP

TE D

M AN U

SC

RI PT

172

9

ACCEPTED MANUSCRIPT attribute is added to the subset, one at a time, and the performance of the subset is

196

computed. When all attributes are examined, the attribute whose addition gave the best

197

performance is chosen to be added to the existing attributes subset.

198

The second approach, called backward elimination, starts with the whole set of attributes

199

and an attribute is removed at each step, with an analogous, to the forward selection, way.

200

The attribute whose removal resulted, still, in the greatest performance is excluded from

201

the subset. Those two approaches guarantee a locally but not necessarily globally – optimal

202

set of attributes [12]. Although these approaches are much faster than exhaustive searches,

203

computational costs remain high for high dimensional datasets. Both approaches multiply

204

evaluation times by a factor of up to m2 where m is the number of features. Once again, it is

205

obvious that the use of those methods is prohibitive, when dealing with datasets containing

206

thousands of features.

M AN U

SC

4.2

Informative marker selection

TE D

207

RI PT

195

In population genetics, there are two analogous fundamental different approaches to

209

perform an informative marker selection task [13]. The first one includes all dedicated

210

software developed especially for this task and in a sense, can be categorized within

211

wrapper methods. Those applications are WHICHLOCI [14], GAFS [15] and BELS [16] and will

212

be discussed later. The second one includes feature evaluation algorithms, which can be

213

categorized as filter methods, but are specialized for population genomic data and cannot

214

be used for general purpose feature evaluation and selection.

215

WHICHLOCI [14] is the first application proposed for the selection of informative markers in

216

2003, based roughly on wrapper philosophy. It is composed of two different steps-

217

procedures. The first procedure ranks every loci according to their assignment success, i.e.

AC C

EP

208

10

ACCEPTED MANUSCRIPT the percentage of individuals correctly assigned to the population of origin. Later on, the

219

second procedure calculates the assignment success of the most informative marker and

220

iteratively adds the next most informative marker and recalculates the assignment success.

221

The procedure stops when a predefined accuracy is achieved. The whole process is identical

222

to the aforementioned forward selection. In the second approach (Genetic Algorithm

223

Feature Selection – GAFS [15]) a genetic algorithm is used. A genetic algorithm is an

224

optimization technique which originates from the field of artificial intelligence and mimics

225

the process of natural selection [17]. In this case, the algorithm searches for the best

226

candidate subset of loci by estimating the classification accuracy of the corresponding

227

assignment test which is the search objective criterion. GAFS can exhaustively search for the

228

best loci combination, although computational costs are extremely high. Finally, in 2008

229

Bromaghin proposed BELS – Backwards Elimination Locus Selection [16]. This operates using

230

the backwards elimination procedure, previously described. The algorithm discards one

231

attribute at each step, until only one locus is left or the level of accuracy reaches a user-

232

defined minimum.

233

Unfortunately, all those applications suffer from certain drawbacks. Firstly, as reported

234

extensively in [18], those applications have been implemented in a way that leads to a

235

systematic upward bias in the predicted accuracy, concerning the selected subset of

236

markers for the individual assignment into groups of origin. The reason is that they use the

237

same set of individuals in order to train and later estimate the accuracy of the model. The

238

other drawbacks are related to the ability of those programs to handle and analyze high

239

dimensional datasets. As mentioned before, those applications have been proposed, tested

240

and extensively used with microsatellite datasets, whose dimensionality is incomparably

241

lower to SNP datasets. At the same time, an exhaustive search of papers that have recently

AC C

EP

TE D

M AN U

SC

RI PT

218

11

ACCEPTED MANUSCRIPT used these software programmes [19, 20, 21, ect.] proves that the number of markers that

243

they can handle was no more than 100-200. It has been recently reported [13] that those

244

applications were unable to load a 57K SNP dataset, with 446 individuals, which is a medium

245

sized SNP dataset. A final drawback is their high computational cost; methods that search

246

for the best combination in features set, can be extremely computationally intensive.

247

The alternative way to perform such an analysis is through the dedicated genetic filter

248

methods which rank SNPs according to their informativeness i.e. the marker information

249

content, which is the amount of information that a locus holds regarding the ancestry of an

250

individual [22]. Until now, several measures have been proposed such as Delta [23],

251

Pairwise Wright’s (1951) FST [24], Global Wright’s (1951) FST [24], Pairwise Weir &

252

Cockerham’s (1984) FST [25], Global Weir & Cockerham’s (1984) FST [25], Informativeness for

253

Assignment (In) [22] and, even, Principal Components Analysis [26].

254

Although those methods are easy to implement, biologists do not always have the necessary

255

informatics skills to efficiently implement those methods or to handle and manipulate SNP

256

datasets. A recently published solution is TRES – Toolbox for Ranking and Evaluation of SNPs

257

[13] which is a collection of algorithms built in a user-friendly and computationally efficient

258

software that can manipulate and analyze datasets, even in the magnitude of millions of

259

genotypes within a matter of seconds. TRES offers two categories of algorithms. Firstly, the

260

most commonly used established genetic filter methods producing a desired set of pre-

261

defined number of top ranked loci. The second category comprised of all essential dataset

262

manipulation algorithms to perform a complete analysis, from the initial pre-process to the

263

final evaluation step. These algorithms include dataset converters, converting initial data

264

into different file formats, dataset splitters for their separation into user defined percentage

AC C

EP

TE D

M AN U

SC

RI PT

242

12

ACCEPTED MANUSCRIPT of train and test sets and, finally, algorithms for datasets construction that contain selected

266

SNPs occurring through SNP selection analysis. Those datasets can be used for the final

267

evaluation step in dedicated software such as GENECLASS2.0 [27].

268

Comparisons of the different metrics have been published many times and results are

269

contradictory at times. In general, no method outperforms all others in all cases and

270

differences between metrics are marginal. This is probably related to the examined species,

271

the levels of genetic heterogeneity among studied populations, the pool of samples

272

considered and the desired stringency of the assignment [5, 28]. For example Wilkinson et

273

al. [5] concluded that Pairwise Wright’s FST was the most successful method, whereas at the

274

same time Ding et al. [28] proposed In as the best evaluator. Another interesting

275

observation from [5] was that pairwise metrics performed better compared to global

276

metrics (e.g. Global Wrights FST vs. Pairwise Wrights FST). Another study [29] compared

277

metrics and programs in a SNP dataset for sockeye salmon and concluded that FST and In

278

performed better than BELS and WHICHLOCI. Obviously, the comparison was made with a

279

fairly small SNP dataset, which was manageable for both programs. Finally, an early

280

attempt to compare genetic filter methods with traditional data mining filter methods, such

281

as Information Gain revealed that genetic methods performed better [30].

SC

M AN U

TE D

EP

AC C

282

RI PT

265

5. Methods

283

In this section, we present all necessary terms, equations and methods used in our

284

experiments.

285

5.1

Frequent Itemsets

286

The term “frequent itemset” has been proposed in the context of association rules mining

287

by Agrawal in 1993 [31, 32]. Association rules mining was first introduced as a market 13

ACCEPTED MANUSCRIPT basket analysis tool, although today it has become one of the most valuable tool for

289

performing unsupervised exploratory data analysis in a wide range of research and

290

commercial areas, including biology and bioinformatics. Some of the best-known

291

applications in biology and bioinformatics are biological sequence analysis, analysis of gene

292

expression data and others. A thorough review of discovering frequent patterns and

293

association rules from biological data including algorithms and applications can be found in

294

[33].

295

The discovery of frequent itemsets is often viewed as the discovery of association rules,

296

although the first one is a fundamental part of association rule mining [34]. More

297

specifically, the discovery of frequent itemsets constitutes the discovery of frequent co-

298

occurrences of items in transactional databases. Such co-occurrences may imply

299

relationships, possibly hidden and unknown, between frequent items. The process of mining

300

association rules involves two steps. The first one includes the discovery of all frequent

301

itemsets contained in a transaction database. In the second step, the association rules are

302

generated from the discovered frequent itemsets. In the following paragraph a more formal

303

description of frequent items is presented [35, 36]:

304

Let I = {i1, i2, …, iN} be a finite set of binary attributes which are called items and D be a finite

305

multiset of transactions, which is called dataset. Each transaction T ∈ D is a set of items such

306

that T⊆I. A set of items is usually called an itemset. The length or size of an itemset depicts

307

the number of items it contains. It is said that a transaction T ∈ D contains an itemset X⊆ I, if

308

X ⊆ T. The support of itemset X is defined as the fraction of the transactions that contain

309

itemset X over the total number of transactions in D:

AC C

EP

TE D

M AN U

SC

RI PT

288

14

ACCEPTED MANUSCRIPT = 310

| ∈ | ⊇ | ||

(1)

Given a minimum support threshold σ∈ (0,1], an itemset X is said to be σ-frequent, or

312

simply frequent in D, if suppD ( X ) ≥ σ .

313

5.2

RI PT

311

SNP Evaluation Methods

Based on the findings of Ding et al. [28] and Wilkinson et al., [5] we decided to use Pairwise

315

Wright’s FST [24] and Informativeness for Assignment [22] in our experimental process.

316

These metrics are probably the most informative and Delta [23], is probably the most

317

commonly used measure of marker informativeness.

319

M AN U

318

SC

314

5.2.1 Delta

For a biallelic marker the Delta value is given by the following equation:

(2)

TE D

320

= −

where pAi is the frequency of the allele A in the ith population and pAj is the frequency of the

322

same allele (A) in the jth population. Delta value is calculated only between two populations,

323

so in the case of more than two populations, it is computed as the average Delta value of all

324

possible combinations between existing populations.

AC C

325

EP

321

5.2.2 Pairwise Wright’s FST

326

Pairwise Wright’s FST for more than two populations is computed with the same approach

327

as outlined for delta. For a biallelic marker the FST value is given by the following equation: =

328

−

15

(2)


where Hs is the average expected heterogygosity across subpopulations and Ht is the

330

expected heterozygosity of the total population [37]. Hs and Ht are given by the following

331

equations: = 2 ∗ ∗

RI PT

332

(3)

= ∗ + " ∗ "

333

(4)

where pAi is the frequency of the allele A in the ith population and pAj is the frequency of the

335

same allele in the jth population and pA (without superscript) is the frequency of allele A in

336

all populations. Notations for allele B are defined similarly.

M AN U

337

SC

334

5.2.3 Informativeness for Assignment (In)

In is given by the following equation, and it is is a mutual information-based statistics. More

339

speficically, it takes into account self-reported ancestry information from sampled

340

individuals [23, 26].

EP

341

TE D

338

,

2

342

AC C

#$ = %(−p' log + p' + %(p/' log + p/' )/K ) -.

(5)

-.

343

Where i= 1, 2, …, K are the populations with K≥2. N are the loci. pij denotes the frequency of

344

allele j in population i and pj denotes the average frequency of allele j over K populations. In

345

all aforementioned equations allele frequencies are calculated as in [26].

16


5.3

Dataset

The experimental analysis was conducted on the dataset of Wilkinson et al. (2012). This

348

dataset comprised of a comprehensive coverage of pig breed types present in Britain; it

349

consisted of 446 pigs almost equally distributed across 14 populations (7 traditional British

350

breeds, 5 commercial purebreds, one imported European breed and one imported Asian

351

breed) that were genotyped with the PorcineSNP60 BeadChip (59,436 SNPs) [38]. Table 1

352

shows the exact distribution of the instances throughout the 14 populations and their split

353

in train and test datasets.

SC

RI PT

347

TABLE 1: Number of Instances for each Population

# instances in Train Dataset

#instances in Test Dataset

Berkshire (BK)

51

22

TE D

Populations

21

9

Duroc (DU)

21

10

16

8

Hampshire (HA)

21

9

Landrace (LR)

21

9

Large Black (LB)

21

9

Large White (LW)

23

11

EP

British Saddleback (BS)

Gloucestershire Old Spots (GLOS)

AC C

355

M AN U

354

17


8

Meishan (MS)

16

8

Middle White (MW)

21

9

Pietrain (PI)

14

7

Tamworth (TA)

21

Welsh (W)

23

9

SC

10

M AN U

356

357

RI PT

Mangalica (MA)

6. FIFS – Frequent Item Feature Selection In this paper, we propose an informative marker selection method called FIFS – Frequent

359

Items Feature Selection. The notion behind this method is to identify SNPs that characterize

360

certain populations. FIFS is a modular method consisting of two main components. The first

361

one, called Population Specific Most Frequent Genotype Identification, is inspired from the

362

frequent itemset theory. This module searches the feature space for those SNPs that

363

present almost unique genotypes for a certain population / class. The second one, called

364

Informative Marker Subset Construction, is a component which constructs the subsets of

365

SNPs that are going to be returned as the most informative. This final step is implemented

366

with 4 different ways. Figure 2 depicts the whole feature selection process.

AC C

EP

TE D

358

18

SC

RI PT

ACCEPTED MANUSCRIPT

367 Figure 2. The Data Mining Method FIFS

M AN U

368 369 370

6.1

Population Specific Most Frequent Genotype Identification

The first part of the algorithm aims to search the feature space and retain only those SNPs

372

that contain a frequent value / genotype for only one population. This is achieved using the

373

following steps:

374

3.1.1 Monomorphic SNP Elimination

375

It is very common for SNP datasets to contain monomorphic loci i.e. loci where all

376

individuals throughout the populations have the same genotype. In other words, these

377

features have the same value for every instance in the dataset. This step ensures that all

378

monomorphic SNPs are eliminated from the analysis, since they exhibit no discriminative

379

power among populations.

380

3.1.2 Genetic to Transactional Data Conversion

381

The second step of the algorithm aims to convert the genetic data to transactional data.

382

More precisely, all nominal attributes are converted to binary, so that each initial feature is

AC C

EP

TE D

371

19

ACCEPTED MANUSCRIPT represented in the transactional database by as many features as the number of the initial

384

feature’s values. For instance (Figure 3), SNP1 with occurring values {AA, AT, TT} is

385

transformed to SNP1_AA, SNP1_AT and SNP1_TT, each of them having values in {1, 0} or

386

{true, false}, which encode the existence or the absence of the current SNP value in the

387

examined instance. In this particular example, the individual’s genotype at SNP1 is AT, in the

388

genetic dataset. In the transactional database, SNP1 will be converted and it will have values

389

“0” for SNP1_AA, “1” for SNP1_AT and “0” for SNP1_TT. In other words the transactional

390

data are no longer the SNPs, but the SNP genotype’s presence/absence values.

391

AC C

EP

TE D

M AN U

SC

RI PT

383

392

Figure 3. Conversion of genetic data to transactional dataset

393

3.1.3 Split Data in Single – Population Datasets

394

As mentioned before, the purpose of the “Population Specific Most Frequent Genotype

395

Identification” part is to find the frequent genotypes in each population. In this step, the

396

algorithm splits the newly created transactional dataset into N different datasets, where N is

20

ACCEPTED MANUSCRIPT the

398

individuals/instances of only one population.

399

3.1.4 Frequent genotypes selection

400

In this step, the algorithm calculates the support of each item/feature in each dataset. For

401

each dataset, which corresponds to a specific population, the algorithm creates a set

402

containing all σ - frequent items i.e. those features with support greater than the user

403

defined threshold.

404

3.1.5 Discarding Common Elements

405

It is essential for the algorithm to retain only features that are uniquely most frequent in

406

each population. After the σ - frequent set construction for all populations, the algorithm

407

performs pair wise comparisons between them, in order to eliminate all co-occurrences.

408

Finally, all features appearing in a set will exist only in that population’s SNP set.

409

3.1.6 Mapping Back to the Genetic Dataset

410

Up to this step the algorithm uses the converted transactional data. For facilitation of

411

results, it is crucial, that this module maps the converted features of each population’s set,

412

back to the original SNP names.

413

Summarizing, this first module receives as input a SNP dataset and returns N – sets

414

containing the σ – frequent genotypes, one for each population.

populations of

origin. Each

newly created

dataset

contains

SC

EP

TE D

M AN U

of

AC C

415

number

RI PT

397

6.2

Informative Marker Subset Construction

416

The output of the first module is N sets of SNPs which have σ – frequent genotypes only for

417

a certain population, where N is the number of populations. The purpose of the second

418

module is to select from those sets a specific number of SNPs for each population. This

419

module is implemented in 4 different ways (discussed later on). In all methods, the 21

ACCEPTED MANUSCRIPT algorithm uses an integer number C, which is the number of features to “describe” each

421

population. The value C is the same across populations. In this step, it is important to

422

mention that although the total number of selected features we await to retrieve is the

423

product of the integer number C as well as the number of the populations, this product

424

represents an upper bound in the final features’ count. This happens because, two binary

425

features, selected from two different populations, may lead to the same initial feature

426

(SNP), as they might represent two different values of the same feature (SNP). For example,

427

one population might be characterized by the presence of genotype AA for a specific SNP,

428

whereas another by the presence of genotype TT for the same SNP. The four different ways

429

that were used to implement the module are as follows:

430

FIFS_RS: The algorithm performs selection via random sampling without replacement, of C

431

number of SNPs from each population’s unique genotype subset.

432

FIFS_In: This method ranks SNPs separately for each populations’ unique genotype subset

433

and generates N-ranked lists. The algorithm selects the top-C SNPs from each ranked

434

list/population.

435

FIFS-RSRN: This is a variation on the FIFS_RS method, inspired by wrapper methods.

436

FIFS_RSRN performs random sampling without replacement X times, and returns X different

437

subsets of informative SNPs. Subsequently, the algorithm calculates their assignment

438

accuracy and finally retains only the subset with the best performance. For the assignment

439

test we used the following equations [40]:

AC C

EP

TE D

M AN U

SC

RI PT

420

2($3 + 165 )($3 + 165 )

440

($ + 1)($ + 2)

22

7 8 ≠ 8′

(6)

ACCEPTED MANUSCRIPT ($3 + 165 + 1)($3 + 165 )

($ + 1)($ + 2)

441

7 8 = 8′

(7)

Where nijk is the number of alleles k sampled at locus j in population i (not counting the

443

individual to be assigned), nij is the number of gene copies sampled at locus j in population i,

444

e.g.

445

2;

$3

3-

A Robust Supervised Variable Selection for Noisy High-Dimensional Data.

Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT.

Regularization method for predicting an ordinal response using longitudinal high-dimensional genomic data.

Compass: a hybrid method for clinical and biobank data mining.

VSViewer3D: a tool for interactive data mining of three-dimensional virtual screening data.

TRES: Identification of Discriminatory and Informative SNPs from Population Genomic Data.

Detecting natural selection in genomic data.

High-dimensional genomic data bias correction and data integration using MANCIE.

Optimizing Training Population Data and Validation of Genomic Selection for Economic Traits in Soft Winter Wheat.

A Composite-Likelihood Method for Detecting Incomplete Selective Sweep from Population Genomic Data.

Interaction Screening for Ultra-High Dimensional Data.

A Fully Automated Method for Discovering Community Structures in High Dimensional Data.

A Data Preparation Methodology in Data Mining Applied to Mortality Population Databases.

Unbiased feature selection in learning random forests for high-dimensional data.

Bit-table based biclustering and frequent closed itemset mining in high-dimensional binary data.

Computed ABC Analysis for Rational Selection of Most Informative Variables in Multivariate Data.

A filter feature selection method based on the Maximal Information Coefficient and Gram-Schmidt Orthogonalization for biomedical data mining.

Multivariate Boosting for Integrative Analysis of High-Dimensional Cancer Genomic Data.

Interactive knowledge discovery and data mining on genomic expression data with numeric formal concept analysis.

Multiple Group Testing Procedures for Analysis of High-Dimensional Genomic Data.

A multiple imputation method for sensitivity analyses of time-to-event data with possibly informative censoring.

Data mining in radiology.

Minimal-assumption inference from population-genomic data.

Genotype Calling from Population-Genomic Sequencing Data.