Accepted Manuscript FIFS: A data mining method for informative marker selection in high dimensional population genomic data Ioannis Kavakiotis, Patroklos Samaras, Alexandros Triantafyllidis, Ioannis Vlahavas PII:
S0010-4825(17)30318-9
DOI:
10.1016/j.compbiomed.2017.09.020
Reference:
CBM 2791
To appear in:
Computers in Biology and Medicine
Received Date: 2 August 2017 Revised Date:
29 August 2017
Accepted Date: 26 September 2017
Please cite this article as: I. Kavakiotis, P. Samaras, A. Triantafyllidis, I. Vlahavas, FIFS: A data mining method for informative marker selection in high dimensional population genomic data, Computers in Biology and Medicine (2017), doi: 10.1016/j.compbiomed.2017.09.020. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
2 3 4
FIFS: A Data Mining Method for Informative Marker Selection in High Dimensional Population Genomic Data
RI PT
1
5
Ioannis Kavakiotisa,b, Patroklos Samarasa, Alexandros Triantafyllidisb, Ioannis
6
Vlahavasa
7
a
8
b
9
University of Thessaloniki, 54124, Greece
SC
School of Informatics, Aristotle University of Thessaloniki, 54124, Greece
10
M AN U
Department of Genetics, Development and Molecular Biology, School of Biology, Aristotle
Correspondence: Ioannis Kavakiotis
12
Email:
[email protected] 13
Address: School of Informatics, Aristotle University of Thessaloniki, 54124, Greece
14
Tel: +30-2310-0998145
15
Fax: +30-231-0998362
EP
AC C
16
TE D
11
17
Abstract
18
Background and Objective
19
Single Nucleotide Polymorphism (SNPs) are, nowadays, becoming the marker of choice for
20
biological analyses involving a wide range of applications with great medical, biological, 1
ACCEPTED MANUSCRIPT economic and environmental interest. Classification tasks i.e. the assignment of individuals
22
to groups of origin based on their (multi-locus) genotypes, are performed in many fields
23
such as forensic investigations, discrimination between wild and/or farmed populations and
24
others. Τhese tasks, should be performed with a small number of loci, for computational as
25
well as biological reasons. Thus, feature selection should precede classification tasks,
26
especially for Single Nucleotide Polymorphism (SNP) datasets, where the number of
27
features can amount to hundreds of thousands or millions.
28
Methods
29
In this paper, we present a novel data mining approach, called FIFS – Frequent Item Feature
30
Selection, based on the use of frequent items for selection of the most informative markers
31
from population genomic data. It is a modular method, consisting of two main components.
32
The first one identifies the most frequent and unique genotypes for each sampled
33
population. The second one selects the most appropriate among them, in order to create
34
the informative SNP subsets to be returned.
35
Results
36
The proposed method (FIFS) was tested on a real dataset, which comprised of a
37
comprehensive coverage of pig breed types present in Britain. This dataset consisted of 446
38
individuals divided in 14 sub-populations, genotyped at 59,436 SNPs.
39
outperforms the state-of-the-art and baseline methods in every case. More specifically, our
40
method surpassed the assignment accuracy threshold of 95% needing only half the number
41
of SNPs selected by other methods (FIFS: 28 SNPs, Delta: 70 SNPs Pairwise FST: 70 SNPs, In:
42
100 SNPs.)
AC C
EP
TE D
M AN U
SC
RI PT
21
2
Our method
ACCEPTED MANUSCRIPT Conclusion
44
Our approach successfully deals with the problem of informative marker selection in high
45
dimensional genomic datasets. It offers better results compared to existing approaches and
46
can aid biologists in selecting the most informative markers with maximum discrimination
47
power for optimization of cost-effective panels with applications related to e.g. species
48
identification, wildlife management, and forensics.
49
Keywords
50
Bioinformatics, Machine Learning, Data Mining, Feature Selection, Frequent Pattern Mining,
51
Single Nucleotide Polymorphism, Population genomics, Ancestry Informative Marker, Big
52
Data
SC
1. Introduction
M AN U
53
RI PT
43
Significant advances in biotechnology and more specifically high - throughput technologies
55
have made big data production feasible, even for small laboratories, and have enabled
56
researchers to become big data users by accessing public repositories such as EBI [1] or the
57
international HapMap consortium [2]. Terabyte sized datasets are now common in sciences
58
including biology [3] and new services should be developed to facilitate their utilization. It is
59
evident that machine learning, data mining and knowledge discovery in biological data will
60
play an important role in upcoming scientific discoveries.
61
Data mining and machine learning constitute computer science subfields strongly related to
62
artificial intelligence, statistics, mathematics and database technology. Those subfields offer
63
sophisticated techniques, producing robust and accurate results when applied to a wide
64
range of sciences. Nowadays, the continuous progress of high throughput sequencing
AC C
EP
TE D
54
3
ACCEPTED MANUSCRIPT technologies has enabled the genotyping of thousands or even millions of SNPs (Single
66
Nucleotide Polymorphisms) and consequently the production of very large – high
67
dimensional SNPs dataset for model as well as non-model organisms.
68
SNPs are rapidly becoming the marker of choice for a wide range of organisms and biological
69
analyses [4] and SNP data analysis is gaining popularity in various applications with great
70
medical, biological and economic interest, such as identification of disease related
71
mutations, Quantitative Trait Loci analyses (QTL), food traceability, brand authentication,
72
discrimination between wild and/or farmed populations and anthropological forensic
73
investigations [5].
74
Several of the above-mentioned applications, require the assignment of individuals, or
75
groups of individuals, to their population of origin, based on their (multi-locus) genotypes
76
[6]. This is a classification task, performed with specialized classifiers for genomic data. As
77
has been reported many times [5, 7, 8] it is important to create minimal panels (i.e.
78
datasets) with maximum discrimination power, i.e. to reduce dataset dimensionality. To
79
achieve this, researchers should select, amongst all genotyped loci (SNPs), those SNPs that
80
can better discriminate analyzed population samples. In terms of machine learning and data
81
mining, this is a feature selection task, while for population genetics it is an informative
82
marker selection task.
83
Informative markers selection is a fairly classic task for population genetics, and was mainly
84
performed with microsatellites in previous decades. The production of SNP datasets with an
85
incomparably higher dimensionality has revealed many drawbacks of the existing
86
approaches and new opportunities, to be discussed later. Still, no current method can be
AC C
EP
TE D
M AN U
SC
RI PT
65
4
ACCEPTED MANUSCRIPT broadly accepted as the most successful, because none outperforms the others in all
88
circumstances.
89
The present work proposes a novel data mining approach, called FIFS – Frequent Item
90
Feature Selection and is based on the notion of frequent items for selecting the most
91
informative markers from population genomics data. It is a modular method, consisting of
92
two main components. The first one identifies the most frequent and unique genotypes for
93
each sampled population. The second one selects the most appropriate among them, in
94
order to create the informative SNP subsets to be returned.
95
FIFS significantly increases the performance of existing methods, achieving better
96
assignment accuracy in any comparison. Moreover, its modular nature facilitates the reuse
97
of many intermediary results making the whole approach extremely efficient. FIFS is
98
implemented in JAVA and therefore can be executed in all operating systems.
99
application and the user manual are available on application’s website:
The
TE D
M AN U
SC
RI PT
87
http://intelligence.csd.auth.gr/bioinformatics/fifs/
101
This paper is organized as follows. The motivation behind this work is described in section 2.
102
Section 3 presents the biological background knowledge concerning SNPs and respective
103
datasets. Section 4 presents related work and existing approaches for the informative
104
marker selection task. In section 5 all necessary terms, equations and methods used in our
105
experimental setup, are presented. Section 6 is dedicated to the detailed description of our
106
approach. Later on, the experimental process is described in section 7. Results of the
107
experiments conducted to compare our new methods with the state-of-the-art and baseline
108
methods are presented in section 8 and the paper is concluded in section 9.
AC C
EP
100
5
ACCEPTED MANUSCRIPT 109
2. Motivation The importance of feature selection for SNP datasets is beyond any dispute either in biology
111
or machine learning. From a biological point of view, the importance of feature selection,
112
i.e. selecting those SNPs with the maximum information power, has been stressed in several
113
scientific projects and papers [5,7,8] and is principally economic. Although valuable, full
114
genome wide data are costly to produce. Smaller panels (i.e. datasets) are faster, cheaper
115
and more flexible and thus facilitate the genotyping of several hundreds of individuals with
116
an incomparably lower cost than the cost of genome wide genotyping.
117
From a data mining perspective, feature selection has multiple beneficial effects on the
118
prediction algorithms. Firstly, It can improve the prediction performance through defying
119
the curse of dimensionality. Moreover, it facilitates computational performance and data
120
understanding through easier data visualization [9]. In case of large SNP datasets, a feature
121
selection process is therefore essential.
122
However, the identification of these (minimum) SNPs has become problematic lately, due to
123
the dimensionality of SNP datasets. Problems are mainly associated with computer science
124
issues and more specifically with the handling, processing and analysis of big data. Existing
125
prominent applications have various computational limitations. All of them have been used
126
to analyze microsatellite datasets, with incomparably lower dimensionality, containing up to
127
100 features/loci. On the other hand, SNP datasets can contain hundreds of thousands loci
128
in animal/plant organisms or even millions in human datasets. The analysis of these datasets
129
with a desktop computer may be prohibitive and, for this reason, impossible. All related
130
problems and issues are discussed later on and are also thoroughly reviewed by [4].
AC C
EP
TE D
M AN U
SC
RI PT
110
6
ACCEPTED MANUSCRIPT Furthermore, considering that SNP selection is a task to be performed mainly by biologists,
132
possessing few or no programming skills, it is crucial that proposed methods are not
133
computationally intensive and can be run, if possible, on a desktop computer.
134
3. Background Knowledge
135
3.1
Single Nucleotide Polymorphisms – SNPs
RI PT
131
Single nucleotide polymorphism is a type of genetic variation and occurs when a single
137
nucleotide (A, T, G or C) in the genome differs between members of a biological species or
138
paired chromosomes. For example, in figure 1 three genotyped DNA sequences are
139
depicted. Those three differ in one nucleotide in the sixth and eleventh position. Those
140
mutations can, in some cases, be harmful and cause various diseases (such as cancer or
141
diabetes), or be harmless especially when they are located in non-coding regions i.e. not in
142
genes. The best-known SNP is the common mutation causing sickle cell anaemia.
AC C
143
EP
TE D
M AN U
SC
136
144
Fig. 1. Three sequences containing two SNPs. Each SNP presents two alleles.
145
SNPs can also affect how organisms respond to drugs, vaccines and pathogens. An allele is
146
one of the possible alternative forms of the same gene or genetic locus. In SNP studies, the
147
existence of two alleles is the most common case. A marker possessing only two alleles is
148
called biallelic.
7
ACCEPTED MANUSCRIPT 149
3.2
Population Genomic Datasets – SNP Datasets
SNP datasets can be found in many file formats. PED files [10], HapMap files [2], Variant Call
151
Format (VCF) and GENEPOP [11] are some of the most commonly used. The dimensionality
152
of SNP datasets can vary a lot. Animal/plant datasets can usually reach a hundred thousand
153
attributes (SNPs), whereas human datasets can be over a million SNPs big. In SNP datasets,
154
each attribute is a biallelic marker i.e. corresponds to the variation between two
155
nucleotides. Since most plant and animal organisms are diploid receiving one set of
156
chromosomes from a male and one from a female parent, the genotype of one individual
157
(i.e. the combination of the two parent alleles) can thus can have at most three values,
158
occuring from the combination of two nucleotides: Two are homozygous and one is
159
heterozygous. For instance, one SNP can present the following values AA, GG and AG (GA
160
and AG are essentially the same). The term allele frequency refers to the frequency of each
161
allele at a specific SNP, in a population. For instance, consider a population containing five
162
individuals. Assume a SNP with two alleles (T and C) corresponding to three possible
163
genotypic combinations (TT, CC, TC). If three individuals have TT genotypes, one CC and one
164
TC, then the allele frequencies are 0.7 for the T allele and 0.3 for the C allele.
SC
M AN U
TE D
EP
4. Related Work
AC C
165
RI PT
150
166
In this section, we present the necessary background knowledge for the feature selection
167
task and link it to available existing population genetics methods for selecting informative
168
markers.
169
4.1
Feature Selection
170
Feature selection methods are divided in two major categories. The first category comprises
171
of “filter” methods, which evaluate the attributes based on general data characteristics . 8
ACCEPTED MANUSCRIPT The second category contains “wrapper” methods. Wrapper methods use machine learning
173
algorithms to evaluate and finally decide which is the most appropriate candidate subset of
174
features [12]. The main advantage of wrappers is that they commonly, (though not in
175
population genetics, as stated later), offer better classification accuracy. The main
176
disadvantage of wrapper methods is that the algorithm has to build a model many times, in
177
order to evaluate different subsets of features, which in some cases, such as in Support
178
Vector Machines (SVMs), is computationally expensive. This is an important drawback
179
especially for SNP datasets whose dimensionality can be extremely high. On the other hand,
180
the main advantage of filters is that they are much faster than wrapper methods and the
181
only available solution probably, when dealing with big data.
182
In case of wrappers, the key task of the whole selection process, which affects the
183
computational performance of the whole method, is the method of searching the attributes’
184
space, in order to select the subset with the best classification accuracy. The number of
185
possible attribute subsets increases exponentially with the number of attributes. For an
186
exhaustive search of the feature space, up to 2m possible subsets have to be examined,
187
where m is the number of attributes. Thus, it is obvious that it can only be used on datasets
188
containing few features.
189
Due to the computational costs of exhaustive searches, greedy approaches are often
190
adopted to search within the attribute space. The two proposed ways are opposite, in
191
general, but operate in a completely analogous way. At each step, the attribute subset
192
changes by adding or deleting a single attribute. More specifically, the first approach, called
193
forward selection, begins with a subset of zero attributes. The aim, at each step, is to select
194
the best attribute to add to the existing subset. To achieve this, every single remaining
AC C
EP
TE D
M AN U
SC
RI PT
172
9
ACCEPTED MANUSCRIPT attribute is added to the subset, one at a time, and the performance of the subset is
196
computed. When all attributes are examined, the attribute whose addition gave the best
197
performance is chosen to be added to the existing attributes subset.
198
The second approach, called backward elimination, starts with the whole set of attributes
199
and an attribute is removed at each step, with an analogous, to the forward selection, way.
200
The attribute whose removal resulted, still, in the greatest performance is excluded from
201
the subset. Those two approaches guarantee a locally but not necessarily globally – optimal
202
set of attributes [12]. Although these approaches are much faster than exhaustive searches,
203
computational costs remain high for high dimensional datasets. Both approaches multiply
204
evaluation times by a factor of up to m2 where m is the number of features. Once again, it is
205
obvious that the use of those methods is prohibitive, when dealing with datasets containing
206
thousands of features.
M AN U
SC
4.2
Informative marker selection
TE D
207
RI PT
195
In population genetics, there are two analogous fundamental different approaches to
209
perform an informative marker selection task [13]. The first one includes all dedicated
210
software developed especially for this task and in a sense, can be categorized within
211
wrapper methods. Those applications are WHICHLOCI [14], GAFS [15] and BELS [16] and will
212
be discussed later. The second one includes feature evaluation algorithms, which can be
213
categorized as filter methods, but are specialized for population genomic data and cannot
214
be used for general purpose feature evaluation and selection.
215
WHICHLOCI [14] is the first application proposed for the selection of informative markers in
216
2003, based roughly on wrapper philosophy. It is composed of two different steps-
217
procedures. The first procedure ranks every loci according to their assignment success, i.e.
AC C
EP
208
10
ACCEPTED MANUSCRIPT the percentage of individuals correctly assigned to the population of origin. Later on, the
219
second procedure calculates the assignment success of the most informative marker and
220
iteratively adds the next most informative marker and recalculates the assignment success.
221
The procedure stops when a predefined accuracy is achieved. The whole process is identical
222
to the aforementioned forward selection. In the second approach (Genetic Algorithm
223
Feature Selection – GAFS [15]) a genetic algorithm is used. A genetic algorithm is an
224
optimization technique which originates from the field of artificial intelligence and mimics
225
the process of natural selection [17]. In this case, the algorithm searches for the best
226
candidate subset of loci by estimating the classification accuracy of the corresponding
227
assignment test which is the search objective criterion. GAFS can exhaustively search for the
228
best loci combination, although computational costs are extremely high. Finally, in 2008
229
Bromaghin proposed BELS – Backwards Elimination Locus Selection [16]. This operates using
230
the backwards elimination procedure, previously described. The algorithm discards one
231
attribute at each step, until only one locus is left or the level of accuracy reaches a user-
232
defined minimum.
233
Unfortunately, all those applications suffer from certain drawbacks. Firstly, as reported
234
extensively in [18], those applications have been implemented in a way that leads to a
235
systematic upward bias in the predicted accuracy, concerning the selected subset of
236
markers for the individual assignment into groups of origin. The reason is that they use the
237
same set of individuals in order to train and later estimate the accuracy of the model. The
238
other drawbacks are related to the ability of those programs to handle and analyze high
239
dimensional datasets. As mentioned before, those applications have been proposed, tested
240
and extensively used with microsatellite datasets, whose dimensionality is incomparably
241
lower to SNP datasets. At the same time, an exhaustive search of papers that have recently
AC C
EP
TE D
M AN U
SC
RI PT
218
11
ACCEPTED MANUSCRIPT used these software programmes [19, 20, 21, ect.] proves that the number of markers that
243
they can handle was no more than 100-200. It has been recently reported [13] that those
244
applications were unable to load a 57K SNP dataset, with 446 individuals, which is a medium
245
sized SNP dataset. A final drawback is their high computational cost; methods that search
246
for the best combination in features set, can be extremely computationally intensive.
247
The alternative way to perform such an analysis is through the dedicated genetic filter
248
methods which rank SNPs according to their informativeness i.e. the marker information
249
content, which is the amount of information that a locus holds regarding the ancestry of an
250
individual [22]. Until now, several measures have been proposed such as Delta [23],
251
Pairwise Wright’s (1951) FST [24], Global Wright’s (1951) FST [24], Pairwise Weir &
252
Cockerham’s (1984) FST [25], Global Weir & Cockerham’s (1984) FST [25], Informativeness for
253
Assignment (In) [22] and, even, Principal Components Analysis [26].
254
Although those methods are easy to implement, biologists do not always have the necessary
255
informatics skills to efficiently implement those methods or to handle and manipulate SNP
256
datasets. A recently published solution is TRES – Toolbox for Ranking and Evaluation of SNPs
257
[13] which is a collection of algorithms built in a user-friendly and computationally efficient
258
software that can manipulate and analyze datasets, even in the magnitude of millions of
259
genotypes within a matter of seconds. TRES offers two categories of algorithms. Firstly, the
260
most commonly used established genetic filter methods producing a desired set of pre-
261
defined number of top ranked loci. The second category comprised of all essential dataset
262
manipulation algorithms to perform a complete analysis, from the initial pre-process to the
263
final evaluation step. These algorithms include dataset converters, converting initial data
264
into different file formats, dataset splitters for their separation into user defined percentage
AC C
EP
TE D
M AN U
SC
RI PT
242
12
ACCEPTED MANUSCRIPT of train and test sets and, finally, algorithms for datasets construction that contain selected
266
SNPs occurring through SNP selection analysis. Those datasets can be used for the final
267
evaluation step in dedicated software such as GENECLASS2.0 [27].
268
Comparisons of the different metrics have been published many times and results are
269
contradictory at times. In general, no method outperforms all others in all cases and
270
differences between metrics are marginal. This is probably related to the examined species,
271
the levels of genetic heterogeneity among studied populations, the pool of samples
272
considered and the desired stringency of the assignment [5, 28]. For example Wilkinson et
273
al. [5] concluded that Pairwise Wright’s FST was the most successful method, whereas at the
274
same time Ding et al. [28] proposed In as the best evaluator. Another interesting
275
observation from [5] was that pairwise metrics performed better compared to global
276
metrics (e.g. Global Wrights FST vs. Pairwise Wrights FST). Another study [29] compared
277
metrics and programs in a SNP dataset for sockeye salmon and concluded that FST and In
278
performed better than BELS and WHICHLOCI. Obviously, the comparison was made with a
279
fairly small SNP dataset, which was manageable for both programs. Finally, an early
280
attempt to compare genetic filter methods with traditional data mining filter methods, such
281
as Information Gain revealed that genetic methods performed better [30].
SC
M AN U
TE D
EP
AC C
282
RI PT
265
5. Methods
283
In this section, we present all necessary terms, equations and methods used in our
284
experiments.
285
5.1
Frequent Itemsets
286
The term “frequent itemset” has been proposed in the context of association rules mining
287
by Agrawal in 1993 [31, 32]. Association rules mining was first introduced as a market 13
ACCEPTED MANUSCRIPT basket analysis tool, although today it has become one of the most valuable tool for
289
performing unsupervised exploratory data analysis in a wide range of research and
290
commercial areas, including biology and bioinformatics. Some of the best-known
291
applications in biology and bioinformatics are biological sequence analysis, analysis of gene
292
expression data and others. A thorough review of discovering frequent patterns and
293
association rules from biological data including algorithms and applications can be found in
294
[33].
295
The discovery of frequent itemsets is often viewed as the discovery of association rules,
296
although the first one is a fundamental part of association rule mining [34]. More
297
specifically, the discovery of frequent itemsets constitutes the discovery of frequent co-
298
occurrences of items in transactional databases. Such co-occurrences may imply
299
relationships, possibly hidden and unknown, between frequent items. The process of mining
300
association rules involves two steps. The first one includes the discovery of all frequent
301
itemsets contained in a transaction database. In the second step, the association rules are
302
generated from the discovered frequent itemsets. In the following paragraph a more formal
303
description of frequent items is presented [35, 36]:
304
Let I = {i1, i2, …, iN} be a finite set of binary attributes which are called items and D be a finite
305
multiset of transactions, which is called dataset. Each transaction T ∈ D is a set of items such
306
that T⊆I. A set of items is usually called an itemset. The length or size of an itemset depicts
307
the number of items it contains. It is said that a transaction T ∈ D contains an itemset X⊆ I, if
308
X ⊆ T. The support of itemset X is defined as the fraction of the transactions that contain
309
itemset X over the total number of transactions in D:
AC C
EP
TE D
M AN U
SC
RI PT
288
14
ACCEPTED MANUSCRIPT = 310
| ∈ | ⊇ | ||
(1)
Given a minimum support threshold σ∈ (0,1], an itemset X is said to be σ-frequent, or
312
simply frequent in D, if suppD ( X ) ≥ σ .
313
5.2
RI PT
311
SNP Evaluation Methods
Based on the findings of Ding et al. [28] and Wilkinson et al., [5] we decided to use Pairwise
315
Wright’s FST [24] and Informativeness for Assignment [22] in our experimental process.
316
These metrics are probably the most informative and Delta [23], is probably the most
317
commonly used measure of marker informativeness.
319
M AN U
318
SC
314
5.2.1 Delta
For a biallelic marker the Delta value is given by the following equation:
(2)
TE D
320
= −
where pAi is the frequency of the allele A in the ith population and pAj is the frequency of the
322
same allele (A) in the jth population. Delta value is calculated only between two populations,
323
so in the case of more than two populations, it is computed as the average Delta value of all
324
possible combinations between existing populations.
AC C
325
EP
321
5.2.2 Pairwise Wright’s FST
326
Pairwise Wright’s FST for more than two populations is computed with the same approach
327
as outlined for delta. For a biallelic marker the FST value is given by the following equation: =
328
−
15
(2)
ACCEPTED MANUSCRIPT 329
where Hs is the average expected heterogygosity across subpopulations and Ht is the
330
expected heterozygosity of the total population [37]. Hs and Ht are given by the following
331
equations: = 2 ∗ ∗
RI PT
332
(3)
= ∗ + " ∗ "
333
(4)
where pAi is the frequency of the allele A in the ith population and pAj is the frequency of the
335
same allele in the jth population and pA (without superscript) is the frequency of allele A in
336
all populations. Notations for allele B are defined similarly.
M AN U
337
SC
334
5.2.3 Informativeness for Assignment (In)
In is given by the following equation, and it is is a mutual information-based statistics. More
339
speficically, it takes into account self-reported ancestry information from sampled
340
individuals [23, 26].
EP
341
TE D
338
,
2
342
AC C
#$ = %(−p' log + p' + %(p/' log + p/' )/K ) -.
(5)
-.
343
Where i= 1, 2, …, K are the populations with K≥2. N are the loci. pij denotes the frequency of
344
allele j in population i and pj denotes the average frequency of allele j over K populations. In
345
all aforementioned equations allele frequencies are calculated as in [26].
16
ACCEPTED MANUSCRIPT 346
5.3
Dataset
The experimental analysis was conducted on the dataset of Wilkinson et al. (2012). This
348
dataset comprised of a comprehensive coverage of pig breed types present in Britain; it
349
consisted of 446 pigs almost equally distributed across 14 populations (7 traditional British
350
breeds, 5 commercial purebreds, one imported European breed and one imported Asian
351
breed) that were genotyped with the PorcineSNP60 BeadChip (59,436 SNPs) [38]. Table 1
352
shows the exact distribution of the instances throughout the 14 populations and their split
353
in train and test datasets.
SC
RI PT
347
TABLE 1: Number of Instances for each Population
# instances in Train Dataset
#instances in Test Dataset
Berkshire (BK)
51
22
TE D
Populations
21
9
Duroc (DU)
21
10
16
8
Hampshire (HA)
21
9
Landrace (LR)
21
9
Large Black (LB)
21
9
Large White (LW)
23
11
EP
British Saddleback (BS)
Gloucestershire Old Spots (GLOS)
AC C
355
M AN U
354
17
ACCEPTED MANUSCRIPT 18
8
Meishan (MS)
16
8
Middle White (MW)
21
9
Pietrain (PI)
14
7
Tamworth (TA)
21
Welsh (W)
23
9
SC
10
M AN U
356
357
RI PT
Mangalica (MA)
6. FIFS – Frequent Item Feature Selection In this paper, we propose an informative marker selection method called FIFS – Frequent
359
Items Feature Selection. The notion behind this method is to identify SNPs that characterize
360
certain populations. FIFS is a modular method consisting of two main components. The first
361
one, called Population Specific Most Frequent Genotype Identification, is inspired from the
362
frequent itemset theory. This module searches the feature space for those SNPs that
363
present almost unique genotypes for a certain population / class. The second one, called
364
Informative Marker Subset Construction, is a component which constructs the subsets of
365
SNPs that are going to be returned as the most informative. This final step is implemented
366
with 4 different ways. Figure 2 depicts the whole feature selection process.
AC C
EP
TE D
358
18
SC
RI PT
ACCEPTED MANUSCRIPT
367 Figure 2. The Data Mining Method FIFS
M AN U
368 369 370
6.1
Population Specific Most Frequent Genotype Identification
The first part of the algorithm aims to search the feature space and retain only those SNPs
372
that contain a frequent value / genotype for only one population. This is achieved using the
373
following steps:
374
3.1.1 Monomorphic SNP Elimination
375
It is very common for SNP datasets to contain monomorphic loci i.e. loci where all
376
individuals throughout the populations have the same genotype. In other words, these
377
features have the same value for every instance in the dataset. This step ensures that all
378
monomorphic SNPs are eliminated from the analysis, since they exhibit no discriminative
379
power among populations.
380
3.1.2 Genetic to Transactional Data Conversion
381
The second step of the algorithm aims to convert the genetic data to transactional data.
382
More precisely, all nominal attributes are converted to binary, so that each initial feature is
AC C
EP
TE D
371
19
ACCEPTED MANUSCRIPT represented in the transactional database by as many features as the number of the initial
384
feature’s values. For instance (Figure 3), SNP1 with occurring values {AA, AT, TT} is
385
transformed to SNP1_AA, SNP1_AT and SNP1_TT, each of them having values in {1, 0} or
386
{true, false}, which encode the existence or the absence of the current SNP value in the
387
examined instance. In this particular example, the individual’s genotype at SNP1 is AT, in the
388
genetic dataset. In the transactional database, SNP1 will be converted and it will have values
389
“0” for SNP1_AA, “1” for SNP1_AT and “0” for SNP1_TT. In other words the transactional
390
data are no longer the SNPs, but the SNP genotype’s presence/absence values.
391
AC C
EP
TE D
M AN U
SC
RI PT
383
392
Figure 3. Conversion of genetic data to transactional dataset
393
3.1.3 Split Data in Single – Population Datasets
394
As mentioned before, the purpose of the “Population Specific Most Frequent Genotype
395
Identification” part is to find the frequent genotypes in each population. In this step, the
396
algorithm splits the newly created transactional dataset into N different datasets, where N is
20
ACCEPTED MANUSCRIPT the
398
individuals/instances of only one population.
399
3.1.4 Frequent genotypes selection
400
In this step, the algorithm calculates the support of each item/feature in each dataset. For
401
each dataset, which corresponds to a specific population, the algorithm creates a set
402
containing all σ - frequent items i.e. those features with support greater than the user
403
defined threshold.
404
3.1.5 Discarding Common Elements
405
It is essential for the algorithm to retain only features that are uniquely most frequent in
406
each population. After the σ - frequent set construction for all populations, the algorithm
407
performs pair wise comparisons between them, in order to eliminate all co-occurrences.
408
Finally, all features appearing in a set will exist only in that population’s SNP set.
409
3.1.6 Mapping Back to the Genetic Dataset
410
Up to this step the algorithm uses the converted transactional data. For facilitation of
411
results, it is crucial, that this module maps the converted features of each population’s set,
412
back to the original SNP names.
413
Summarizing, this first module receives as input a SNP dataset and returns N – sets
414
containing the σ – frequent genotypes, one for each population.
populations of
origin. Each
newly created
dataset
contains
SC
EP
TE D
M AN U
of
AC C
415
number
RI PT
397
6.2
Informative Marker Subset Construction
416
The output of the first module is N sets of SNPs which have σ – frequent genotypes only for
417
a certain population, where N is the number of populations. The purpose of the second
418
module is to select from those sets a specific number of SNPs for each population. This
419
module is implemented in 4 different ways (discussed later on). In all methods, the 21
ACCEPTED MANUSCRIPT algorithm uses an integer number C, which is the number of features to “describe” each
421
population. The value C is the same across populations. In this step, it is important to
422
mention that although the total number of selected features we await to retrieve is the
423
product of the integer number C as well as the number of the populations, this product
424
represents an upper bound in the final features’ count. This happens because, two binary
425
features, selected from two different populations, may lead to the same initial feature
426
(SNP), as they might represent two different values of the same feature (SNP). For example,
427
one population might be characterized by the presence of genotype AA for a specific SNP,
428
whereas another by the presence of genotype TT for the same SNP. The four different ways
429
that were used to implement the module are as follows:
430
FIFS_RS: The algorithm performs selection via random sampling without replacement, of C
431
number of SNPs from each population’s unique genotype subset.
432
FIFS_In: This method ranks SNPs separately for each populations’ unique genotype subset
433
and generates N-ranked lists. The algorithm selects the top-C SNPs from each ranked
434
list/population.
435
FIFS-RSRN: This is a variation on the FIFS_RS method, inspired by wrapper methods.
436
FIFS_RSRN performs random sampling without replacement X times, and returns X different
437
subsets of informative SNPs. Subsequently, the algorithm calculates their assignment
438
accuracy and finally retains only the subset with the best performance. For the assignment
439
test we used the following equations [40]:
AC C
EP
TE D
M AN U
SC
RI PT
420
2($3 + 165 )($3 + 165 )
440
($ + 1)($ + 2)
22
7 8 ≠ 8′
(6)
ACCEPTED MANUSCRIPT ($3 + 165 + 1)($3 + 165 )
($ + 1)($ + 2)
441
7 8 = 8′
(7)
Where nijk is the number of alleles k sampled at locus j in population i (not counting the
443
individual to be assigned), nij is the number of gene copies sampled at locus j in population i,
444
e.g.
445
2;
$3
3-