Accepted Manuscript Variable importance analysis based on rank aggregation with applications in metabolomics for biomarker discovery Yong-Huan Yun, Bai-Chuan Deng, Dong-Sheng Cao, Wei-Ting Wang, Yi-Zeng Liang PII:
S0003-2670(16)30021-6
DOI:
10.1016/j.aca.2015.12.043
Reference:
ACA 234336
To appear in:
Analytica Chimica Acta
Received Date: 15 September 2015 Revised Date:
28 December 2015
Accepted Date: 30 December 2015
Please cite this article as: Y.-H. Yun, B.-C. Deng, D.-S. Cao, W.-T. Wang, Y.-Z. Liang, Variable importance analysis based on rank aggregation with applications in metabolomics for biomarker discovery, Analytica Chimica Acta (2016), doi: 10.1016/j.aca.2015.12.043. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Variable importance analysis based on rank aggregation
2
with applications in metabolomics for biomarker discovery
3
Yong-Huan Yun1, Bai-Chuan Deng2, Dong-Sheng Cao3, Wei-Ting Wang1, Yi-Zeng Liang1,∗ 1
4
RI PT
1
College of Chemistry and Chemical Engineering, Central South University, Changsha 410083, P.R. China
5 6
2
7
3
8
Abstract
9
Biomarker discovery is one important goal in metabolomics, which is typically
10
modeled as selecting the most discriminating metabolites for classification and often
11
referred to as variable importance analysis or variable selection. Until now, a number
12
of variable importance analysis methods to discover biomarkers in the metabolomics
13
studies have been proposed. However, different methods are mostly likely to generate
14
different variable ranking results due to their different principles. Each method
15
generates a variable ranking list just as an expert presents an opinion. The problem of
16
inconsistency between different variable ranking methods is often ignored. To address
17
this problem, a simple and ideal solution is that every ranking should be taken into
18
account. In this study, a strategy, called rank aggregation, was employed. It is an
19
indispensable tool for merging individual ranking lists into a single "super"-list
20
reflective of the overall preference or importance within the population. This
21
"super"-list is regarded as the final ranking for biomarker discovery. Finally, it was
College of Animal Science, South China Agricultural University, Guangzhou 510642, P.R. China
AC C
EP
TE D
M AN U
SC
College of Pharmaceutical Sciences, Central South University, Changsha 410083, P.R. China
∗
Corresponding author. Tel.:+86-0731-88830831.
E-mail address:
[email protected] (Y.Z.Liang) 1
ACCEPTED MANUSCRIPT used for biomarkers discovery and selecting the best variable subset with the highest
23
predictive classification accuracy. Nine methods were used, including three univariate
24
filtering and six multivariate methods. When applied to two metabolic datasets
25
(Childhood overweight dataset and Tubulointerstitial lesions dataset), the results show
26
that the performance of rank aggregation has improved greatly with higher prediction
27
accuracy compared with using all variables. Moreover, it is also better than penalized
28
method, least absolute shrinkage and selectionator operator (LASSO), with higher
29
prediction accuracy or less number of selected variables which are more interpretable.
30
Key words: Variable importance; Variable ranking; Biomarker discovery; Rank
31
aggregation; Metabolomics;
AC C
EP
TE D
32
M AN U
SC
RI PT
22
2
ACCEPTED MANUSCRIPT
1. Introduction
34
Metabolomics is an emerging field which combines strategies to identify and quantify
35
cellular metabolites present in organisms, cells, or tissues using advanced analytical
36
techniques with the application of statistical and multi-variant methods. One target of
37
metabolomics research is to discover biomarkers, which can aid in the diagnosis of
38
many diseases in the metabolic system, biological and clinical guidance. The
39
discovery of biomarkers is typically modeled as selecting the most discriminating
40
variables (metabolites) for classification (e.g., discriminating diseased versus healthy)
41
[1], which is often referred to as variable importance analysis or variable selection in
42
the language of statistics and machine learning. Any variable importance analysis
43
method can be turned into variable selection by introducing a threshold on variable
44
importance values. In the past two decades, a large amount of papers about biomarker
45
discovery in the metabolomics studies have employed the statistical method [2-4]
46
including univariate filtering method and multivariate methods (principal component
47
analysis (PCA), partial least squares-linear discriminant analysis (PLS-DA) [5],
48
support vector machine (SVM) [6] , random forest (RForest) [7] and penalized
49
method like least absolute shrinkage and selectionator operator (LASSO) [8] and
50
elastic net [9]). Ranking of variable (metabolite) importance is often carried out with
51
the help of the statistical method. The ranking is assigning a measure of importance to
52
each variable. Then, a subset of all metabolites could be identified by setting a
53
threshold value. Variable importance analysis methods consist of model-free and
54
model-based approaches. One group of the model-free approaches is based on simple
AC C
EP
TE D
M AN U
SC
RI PT
33
3
ACCEPTED MANUSCRIPT univariate test statistics methods without using the model information, such as t test,
56
fold change and Wilcoxon rank-sum test. They reflect whether the difference between
57
healthy and diseased groups’ averages is statistically significant. The other group is
58
based on the features of relationship between variables and classification label, such
59
as correlation coefficient, information gain, Euclidean distance and mutual
60
information. As for model-based approaches, they are tied to the model performance
61
including model-fitting and model-prediction approach. Model-fitting approach is
62
commonly used when the established model fits itself, which contains partial least
63
squares (PLS) weights [10], PLS loadings [10], regression coefficient (RC) [11],
64
variable importance in projection (VIP) [12] and selectivity ratio (SR) [13-15].
65
Model-prediction approach is based on the model prediction performance by means of
66
resampling methods including jackknife, bootstrap and cross validation. Variable
67
importance analysis based on random variable combination (VIAVC) [16],
68
subwindow permutation analysis (SPA) [17], uninformative variable elimination
69
(UVE) [18], random frog (RFrog) [19], margin influence analysis (MIA) [20],
70
RForest and LASSO are assigned to the model-prediction approach. To date,
71
researchers have often applied a lot of different variable important analysis methods
72
to get as much as possible out of their data and present only the most favorable results,
73
and then interpret the biomarkers that have a good performance with high
74
classification accuracy. However, it is possible that there exist many different subsets
75
of variables from different methods that can achieve the same or similar predictive
76
accuracy. Different variable importance analysis methods are mostly likely to generate
AC C
EP
TE D
M AN U
SC
RI PT
55
4
ACCEPTED MANUSCRIPT different variable rankings due to their different principles even when considering
78
top-ranking variables. For instance, method “A” identifies the top three variables as
79
potential biomarkers, whereas these biomarkers are not recognized as top ranking by
80
method “B”. Both methods have discovered two different variable subsets that have
81
the same classification accuracy. Consequently, it cannot assess which method is
82
accurate and feasible. Moreover, inconsistency between variable rankings are often
83
ignored in metabolomics research when presenting a new data set, because they would
84
make the interpretation of results more confusing and arouse doubts on the reliability
85
of their data. As Franklin D. Roosevelt once said that "there are as many opinions as
86
there are experts". Each method generates a variable ranking list just as an expert
87
presents an opinion. The multiplicity of methods and the question of how to deal with
88
the different ranking results are very general issues in the analysis of variable
89
importance. To address this problem, a simple and ideal solution is to take every
90
ranking into account.
TE D
M AN U
SC
RI PT
77
In this study, a strategy that can combine all methods’ ranking results, called rank
92
aggregation firstly proposed by Vasyl et al. [21], was introduced. It is an indispensable
93
tool to merge individual ranking lists into a single "super"-list reflective of the overall
94
preference or importance within the population. In other word, it aims to find a
95
"super"-list which would be as "close" as possible to all individual ranking lists
96
simultaneously. This "super"-list is regarded as the final ranking for biomarker
97
discovery. In this work, we employed nine variable importance analysis methods from
98
the model-free and model based approaches, including t test, Wilcoxon rank-sum test,
AC C
EP
91
5
ACCEPTED MANUSCRIPT Relief, PLS-RC, PLS-VIP, SPA, RFrog, MIA and RForest. All nine methods were
100
used on the two metabolomics data to generate the rankings of variable importance.
101
Rank aggregation was then used to ensemble all the variable ranking and produce a
102
"super" ranking list. Finally, this ranking list was used to select the best variable
103
subset with the highest predictive classification accuracy.
104
2. Methods and theory
105
2.1 Variable importance analysis methods
106
As illustrated in the section of introduction, the variable importance analysis methods
107
can be separated into two groups as model-free and model-based approaches. In this
108
section, the methods we used for rank aggregation are introduced in brief.
109
2.1.1 Model-free approaches
110
Model-free approaches contain two different strategies. One is based on univariate
111
test statistics methods, such as t test and Wilcoxon rank-sum test. The other one is
112
based on the statistical features between variables and classification label such as
113
correlation coefficient, information gain, Euclidean distance and Mutual information.
114
T test statistic
115
A t test’s statistical significance indicates whether or not the difference between
116
healthy and diseased groups’ averages. The smaller the P-value of t test, the larger the
117
significance difference of two groups. Each variable of data X will have a P-value
118
when calculating the difference between healthy and diseased groups’ averages by t
119
test. All variables could be ranked with the P-value by ascend.
120
Wilcoxon rank-sum test
AC C
EP
TE D
M AN U
SC
RI PT
99
6
ACCEPTED MANUSCRIPT Wilcoxon rank-sum test is a nonparametric which is based solely on the order in
122
which the samples from the two groups. The smaller the P-value of Wilcoxon
123
rank-sum test, the larger the significance difference of two groups. Thus, as like the t
124
test, all variables could be ranked based on the P-value by ascend. Since it compares
125
to rank sums, the Wilcoxon rank-sum test is more robust than the t-test because it is
126
less likely to manifest spurious results based on the existing of outliers.
127
Relief
128
Relief algorithm [22, 23] is a correlation-based method assigning a “relevance”
129
weight to each variable, which is meant to denote the relevance and redundancy
130
analysis of the variable to the target concept. The core idea of Relief is to estimate
131
variables based on difference between the selected sample and the two nearest
132
samples of the same and opposite class (the “near-hit” and “near-miss”). In fact,
133
Relief attempts to approximate the following difference of probabilities for the weight
134
of a variable A, WA:
135
WA = K (different value of A nearest sample from different class) - K (different value
136
of A nearest sample from same class)
SC
M AN U
TE D
EP
AC C
137
RI PT
121
The principle is that good variable should distinguish between samples from
138
different classes and should have the same value for samples from the same class. The
139
output of Relief is the rank of the variable based on WA. The larger the WA, the higher
140
place the corresponding variable obtain.
141
2.1.2 Model-based approaches
142
Model-based approaches are related to the model performance including fitting and 7
ACCEPTED MANUSCRIPT prediction performance. Model-fitting approach is commonly used in the PLS-DA
144
modeling. The constructed PLS-DA model fits itself and then generates a series of
145
information, such as weights, loadings, RC, VIP and SR. Here, we briefly introduce
146
the selected model-fitting methods, including RC and VIP.
147
PLS-DA is derived from PLS regression [6] and involves forming a regression model
148
between the X and y. Given a sample matrix X with dimensions n×k, it contains n
149
samples in rows and k variables in columns, and the corresponding class label y of
150
size n×1 which includes 1 or -1 for binary classification case and factorizations with A
151
components.
M AN U
SC
RI PT
143
There are many versions of the PLS algorithm, for simplicity, we here focus only
153
on the single response case called PLS1 or simply PLS from the orthogonal score PLS.
154
Assume that A (A≤k) is equal to the number of relevant components for prediction,
155
based on the definition by Naes and Helland [24].
156
1. Calculate the PLS weight vector w
159
160
EP
158
163
(1)
t
=
Xw 2 ∑w
(2)
3. Calculate the X loadings by
161
162
w = X'y
2. Calculate the scores, which are given by
AC C
157
TE D
152
p=
t' X 2 ∑t
(3)
q=
y' t 2 ∑t
(4)
4. Calculate the y loading by
8
ACCEPTED MANUSCRIPT 164
5. Deflate X and y by subtracting the contribution of t to obtain the residual data
165
matrix
X residual = X − tp
(5)
167
yresidual = y − tq
(6)
RI PT
166
168
6. For PLS component a = 1, 2, …, A, the algorithm runs from step 1 to step 5
169
through replacing both X and y by their residuals.
Once the PLS model is built, it is then possible to predict the value of y for both
171
the original data and test set samples (known origins), or future samples (unknown
172
origins), as follows:
173
The relationship between X and y can be expressed by
174
y = X β + f = Tq + f
175
Where β is a regression coefficient vector with dimensions k×1; thus, an unknown
176
sample value of y can be predicted by
179
M AN U
TE D
(8)
The estimation of β can be obtained as follows
EP
178
ˆy = xb
(7)
β = w ( pw ) q −1
(9)
AC C
177
SC
170
180
Regression coefficients (RC, β)
181
The vector of RC (β) is a single measure of relationship between each variable and the
182
corresponding response (class label). Variables having small absolute values of this
183
filter measure can be eliminated as the uninformative or interfering variables [18,
184
25-27]. Furthermore, instead of relying on the single observation on the value of β,
185
distribution or variation is also taken into consideration. In this way, resampling 9
ACCEPTED MANUSCRIPT methods such as bootstrapping and Monte Carlo sampling may give more efficient
187
regression coefficients estimation but at the expense of extra computation time[18].
188
With the measure of β, the larger the value of regression coefficient, the more
189
important the variable.
190
Variable importance in the projection (VIP)
191
VIP termed by [28] is to accumulate the importance of each variable j being reflected
192
by w from each component, a. VIP measure is defined as A
(
p ∑ qa2 t'a ta a-1
)(w
aj
/ wa
SC
VIP j =
2
)
(
A / ∑ q 2 t' t a-1 a a a
M AN U
193
RI PT
186
)
(10)
194
VIP is the proportion of the fraction of the explained variance of X expressed by
195
qa2 t'a ta weighted by the covariance between X and y, represented by waj / wa , for
196
each variable j over all components. Furthermore, variable j can be eliminated if VIPj
197
< u for defined threshold u ∈ [0, ∞ ) by users. It is generally accepted that a variable
198
should be selected if “VIP scores > 1” which is typically used as the criterion for
199
important variable selection [5, 29, 30]. In this study, the variable would be sorted by
200
descend based on their VIP scores.
TE D
EP
As for Model-prediction approach, it is on the basis of the model prediction
AC C
201
2
202
performance by means of resampling methods.
203
Subwindow permutation analysis (SPA)
204
SPA [17] coupled with PLS-DA investigate the importance of each variable by
205
implicitly taking the synergetic effect of the rest of the variables into account. It
206
assesses the variable importance of every variable by comparing the distributions of
207
normal prediction errors (NPEs) and permuted prediction errors (PPEs) of the same 10
ACCEPTED MANUSCRIPT variable with the strategy of model population analysis (MPA) [16, 31-36]. MPA aims
209
at exploring the statistical information from a large ‘population’ of sub-models which
210
are built on different sub-datasets generated from the original dataset using the Monte
211
Carlo (MC) sampling technique. The use of variable subset makes SPA more efficient
212
and faster for the large dimensionality of some datasets. SPA can finally output a
213
conditional from two distributions of NPEs and PPEs for each variable. The smaller
214
the P-value, the more important the corresponding variable.
215
Random frog (RFrog)
216
RFrog applies a reversible jump Markov Chain Monte Carlo (RJMCMC)-like
217
algorithm [37] to assess the importance of variable. It conducts a search in the model
218
space through both fixed and trans-dimensional moves between different models
219
generated by MC Sampling, and then a pseudo-MCMC chain is computed and used to
220
compute a selection probability which can be regarded as a criterion to measure the
221
relevance of each variable. Variables can then be ranked in terms of their selection
222
frequency from a population of obtained variables subsets. If the variable is selected
223
most frequently, it means that it is the most important variable.
224
Margin Influence Analysis (MIA)
225
MIA [20] is based on MPA and designed to work with support vector machines (SVM)
226
for variable selection. It reveals statistically significant variables on the basis of the
227
fact that the margin of SVM plays an important role in the generalization performance
228
of SVM models. Firstly, MC sampling is employed to generate a population of
229
sub-data (different variable combinations) in the variable space; Sub-model is then
AC C
EP
TE D
M AN U
SC
RI PT
208
11
ACCEPTED MANUSCRIPT built on each sub-data using SVM. Finally, the importance of each variable is assessed
231
in pairs by the two groups of margins distribution of all constructed SVM classifiers
232
(with and without the specific variable in a large variable combination). The
233
nonparametric test, Mann-Whitney U test, is used for statistical analysis of margin
234
distribution to check whether the increment of margin is significant, leading to a
235
P-value for each variable. The smallest P-value represents the most informative
236
variables.
237
Random Forest (RForest)
238
RForest developed by Breiman [7] is a classification algorithm that ensembles
239
unpruned decision trees, each of which is constructed via a tree classification
240
algorithm such as classification and regression tree (CART) [38]. Each tree is grown
241
on an independently drawn bootstrap sample and a randomly selected small group of
242
variables from the training data. The forest decides the class having the majority vote
243
after it is formed. Variable importance of RForest is measured by the increase in
244
prediction error if the values of that variable are permuted across the out-of-bag (OOB)
245
observations. For each variable, the classification accuracies of the OOB samples with
246
and without permutation of the values are computed. Then they are averaged over the
247
entire ensemble and divided by the standard deviation over the entire ensemble to
248
produce the importance of each variable.
249
2.2 Rank aggregation
250
Rank aggregation [21] techniques aim to provide a general framework to objectively
251
perform the essential aggregation. With the rapid growth of metobolomics studies, the
AC C
EP
TE D
M AN U
SC
RI PT
230
12
ACCEPTED MANUSCRIPT potential application of rank aggregation becomes even more necessary to solve
253
problem of inconsistency between different variable ranking methods. When assessing
254
the variable importance, the above mentioned methods will give different variables
255
ranking results. One of the major strengths of rank-based aggregation is the ability to
256
merge all ranking lists generated from different algorithms.
RI PT
252
With the different variable ranking lists, the objective function need to be defined
258
at first in order to perform the rank aggregation in the framework of an optimization
259
problem. In this sense, we intend to find a "super"-list which will be as "close" as
260
possible to all individual variables ranking lists simultaneously. Thus, a rank
261
aggregation problem can be turned into a minimization problem in all possible
262
combinations of variable ranking lists as follows:
M AN U
SC
257
m
φ (δ ) = ∑ wi d (δ ,Li )
263
(18)
TE D
i =1
where δ is a proposed ranking list of length k = Li , wi is the important weight
265
related with list Li, d is a distance function and Li is the ith ranking list.
EP
264
The idea is to find δ* which will minimize all the distance between δ* and L'i .
267
δ ∗ = arg min ∑ wi d (δ ,Li )
268
AC C
266
m
i =1
(19)
In this work, one of the most popular distance functions, Spearman footrule
269
distance is used to measure the distance between ordered lists [21]. Spearman footrule
270
distance between Li and any ranking list δ can be defined as
271
S ( d,Li ) = ∑ r δ ( t ) - r Li ( t ) t ⊂ Li
272
(20)
where r Li ( A) is the rank of A in the list Li (1 means the "best") if A is within the top 13
ACCEPTED MANUSCRIPT 273
k, and be equal to k + 1, otherwise; rδ ( A) is defined at the same. It is nothing more than the sum of the absolute differences between the ranks of
275
all individual elements from both ranking lists combined. It is a very intuitive metric
276
for comparing two ranking lists. The smaller the value of the metric, the more similar
277
the ranking lists.
RI PT
274
Cross-entropy (CE) Monte Carlo algorithm [39] is used as the minimization
279
algorithm to find the optimal variable ranking list. The details of the CE algorithm
280
could be found in [40]. Briefly, the CE is a stochastic search algorithm in the matrices
281
of 0-1 values with columns summing to one and rows summing to at most one since
282
any ranking list can be uniquely mapped to this matrix.
283
2.3 The selection of optimal variable subset
284
This step is to discover the potential biomarker. The optimal variable subset is
285
selected according to the aggregated ranking list of all selected variable importance
286
analysis methods. The variable based on the "super"-list is sequentially added to build
287
a PLS-DA model whose performance is assessed using 10-fold double cross
288
validation (DCV) [41]. The Area Under the receiver operating characteristic Curve
289
(AUC) value of 10-fold double cross validation (DCV) of PLS-DA is employed to
290
assess each variable subset. The variable whose addition results in the maximum AUC
291
value is chosen as the threshold. In other words, for example, when the fifth variable
292
on the ranking list has the maximum AUC value, the top five variables is the optimal
293
variable subset with the best performance, 10-fold DCV AUC value.
294
3. Software
AC C
EP
TE D
M AN U
SC
278
14
ACCEPTED MANUSCRIPT All the calculations were performed in MATLAB (Version 2013a, the MathWorks,
296
Inc.) and R (i386 3.1.2 version) programs on a general-purpose computer with Intel®
297
Core® i5 3.2 GHz CPU and 3GB of RAM carring the Microsoft Windows XP as the
298
operating system. The in house MATLAB code and our written code
299
(http://www.libpls.net) were used. The R package, Rankaggre, was used in this study.
300
4. Results and discussion
301
In addition, the maximum latent variable was set to 5 in the cross validation of
302
PLS-DA. All the data was first autoscaled to have zero mean and unit variance before
303
modeling except random forest model.
304
4.1 Childhood overweight data
305
This data was obtained by GC-MS analysis on the blood plasma samples collected
306
from 16 children of normal weight and 13 overweight children at the Xiangya
307
Hospital of Central South University in Changsha, China. The clinical experiments
308
were approved by Xiangya Institutional Human Subjects Committee. All the samples
309
were performed on a Shimadzu GC–MS-QP2010 gas chromatography-mass
310
spectrometry instrument (Shimadzu, Kyoto, Japan), and thirty metabolites were
311
identified and accurately quantified using the internal standard. More details about
312
this data are given in [42].
SC
M AN U
TE D
EP
AC C
313
RI PT
295
Nine variable importance methods were performed on this data. The results are
314
shown in Fig. 1. As can be seen from the Fig. 1, different methods generate different
315
results, which confuse our selection. Thus, rank aggregation is necessary to integrate
316
different results and provide a consensual variable ranking list. This ranking list is 15
ACCEPTED MANUSCRIPT 317
reliable and used to select the best variable subset with the highest predictive
318
classification accuracy.
319
(Insert Figure 1)
RI PT
320 321
As for revealing the best variable subset, 10-fold DCV of PLS-DA was
323
conducted by means of adding the top variables sequentially according to the
324
aggregated rank. The results of 10-fold DCV of PLS-DA for each subset are displayed
325
in Table 1 and Fig. 2a. The top 3 metabolites (NO. 12, 23 and 5 variables) as the best
326
subset have the maximum AUC value of 0.8534 and highest prediction accuracy of
327
0.8621. Rank aggregation aims to merge individual ranking lists into a single
328
"super"-list reflective of the overall preference or importance within the population.
329
Thus, these 3 metabolites have been commonly given a high ranking score. The
330
NO.12 variable appears seven times in the first place, one time in the third place and
331
one time in the fourth place. It thus gets the first place in the aggregated ranking list.
332
For the NO.23 variable, it gets five times in the second place, one time in the third
333
place, one time in the fourth place and two times in seventh place. Although the place
334
of NO.23 variable is higher than NO.12 variable in the ranking of Wilcoxon rank-sum
335
test, it is not better than NO.12 variable which obtains the first place by seven
336
methods. Thus, assessing the importance of each variable should be based on the
337
opinion of each method. It is unreliable to assess the importance only relying on the
338
result of one method. As another example, the NO. 20 variable has acquired the first
AC C
EP
TE D
M AN U
SC
322
16
ACCEPTED MANUSCRIPT place by the Wilcoxon rank-sum test, but it get fluctuated ranking score by other
340
methods as 4th, 8th, 9th, 10th, 12th, 16th and 23th places, leading to the 9th place in the
341
aggregated ranking list. Therefore, it is necessary to perform rank aggregation to
342
integrate the different ranking results and produce a consensual and reliable result.
RI PT
339
In order to make a comparison with the biomarker discovery by rank aggregation,
344
the penalized method, LASSO was used. LASSO is a shrinkage and variable selection
345
method which minimizes the residuals sum of squared errors subject to the sum of the
346
absolute values of the coefficients being less than a constant. This constraint allows
347
LASSO to perform the approach of variable selection because the shrinkage of the
348
coefficients leads to that some coefficients can be shrunk exactly to zero. As a result,
349
it reduces the estimation variance but produces the informative variables and provides
350
an interpretable final model. The variables corresponding to the zero coefficients are
351
uninformative. The selected variables by LASSO having the non-zero coefficients are
352
shown in Table 1. It can be seen that the selected variables by rank aggregation is
353
included in the selection of variables by LASSO. However, rank aggregation performs
354
better than LASSO with higher prediction accuracy and less number of variables,
355
which indicates that rank aggregation is more reliable and the selection of variables is
356
consensus and more interpretable.
M AN U
TE D
EP
AC C
357
SC
343
358
(Insert Table 1)
359
(Insert Figure 2)
360 17
ACCEPTED MANUSCRIPT As the top 3 variables are finally selected as the best subset, the overlapping
362
number between the top 3 variables of aggregated list and each method is used to
363
simply assess the performance of each method. The top 3 variables of SPA, RFrog and
364
MIA are exactly the same as the aggregated list. VIP performs the worst as there is no
365
overlapping variable in the top 3 ones. For this data, model-prediction approaches are
366
better than model-fitting and model-free approaches
RI PT
361
As for the selected metabolites, NO. 12, 23 and 5 variables, they are
368
β-Hydroxybutyric acid, glyceric acid and palmitic acid (PA), respectively. As reported
369
in the study of Proenza et al. [43], the level of β-Hydroxybutyrate of overweight men
370
is higher than lean men, which indicates the close relationship between
371
β-Hydroxybutyrate and overweight diagnose, and also correctly identifies this
372
metabolite as an important variable. With regard to glyceric acid, it can be derived
373
from oxidation of glycerol and some phosphate derivatives of glyceric acid. They
374
include 2-phosphoglyceric acid, 3-phosphoglyceric acid, 2,3-bisphosphoglyceric acid
375
and 1,3-bisphosphoglyceric acid, which are the very important biochemical
376
intermediates of lipid metabolism involved in the process of overweight development
377
[44]. Besides, the increase of PA in diet can reduce fat oxidation and may also
378
increase the risk in obesity and insulin resistance from the work of Kien et al. [45]. It
379
is possible to find that PA works importantly in the risk of obesity. Thus, PA is directly
380
related with overweight diagnosis.
381
4.2 Tubulointerstitial lesions data (TLS)
382
TLS data was obtained by GC-MS analysis on the urinary samples collected from 25
AC C
EP
TE D
M AN U
SC
367
18
ACCEPTED MANUSCRIPT patients with mild tubulointerstitial lesions and 25 patients with severe
384
tubulointerstitial lesions at the Department of Nephrology of the University Hospital
385
of Ioannina, each with 200 bins. All study participants gave informed consent for the
386
investigation, which was approved by the Ethical Committee of the University
387
Hospital of Ioannina. More detailed descriptions is given in the reference [46].
RI PT
383
TLS data has 200 variables, most of which are uninformative. Thus, Rank
389
aggregation just generated the top aggregated 30 variables in this work. The results of
390
variable ranking by nine variable importance methods and their aggregation are
391
shown in Fig. 3. It can be seen visually that different methods generate different
392
results. When rank aggregation was performed to integrate the different results, it
393
provided a consensual variable ranking list that considered every opinion of nine
394
methods. The merged list was based on the rank score of each method. From the Fig.
395
3, NO.142 variable is selected as the first place in the aggregated list because most of
396
the methods give it a high rank score. NO.182 variable has got the first place by two
397
methods (Relief and RFrog), while NO. 138 variable has also got the first place by
398
two methods (t test and VIP), but NO.182 variable obtains more fluctuated results
399
than NO. 138. Therefore, NO. 138 variable gets the higher place than NO.182. It
400
demonstrates again that assessing the importance of each variable should be based on
401
all methods’ results not relying on one method. The aggregated ranking list was then
402
used to select the best variable subset with the highest predictive classification
403
accuracy for biomarker discovery. 10-fold DCV of PLS-DA was conducted by means
404
of adding the top variables sequentially according to the aggregated ranking list. The
AC C
EP
TE D
M AN U
SC
388
19
ACCEPTED MANUSCRIPT results of 10-fold DCV of PLS-DA for each subset are displayed in Fig. 2b and Table
406
2. The top 3 metabolites (NO. 142, 138, and 182 variables) have been selected as the
407
best subset with the maximum AUC value of 0.96 and highest prediction accuracy of
408
1. When compared with LASSO method, rank aggregation selects less number of
409
variables than LASSO with the same prediction performance (Table 2). It proves that
410
different variable subsets from different methods can achieve the same predictive
411
performance. The selected variables by rank aggregation are also included in the
412
selection of variables by LASSO. In the original reference study, these three variables
413
belong to two biomarkers: citrate and proteins. They were found in relatively higher
414
levels in patients. A previous study [42] has reported that the initiation of
415
tubulointerstitial lesions is characterized by decreased excretion of citrate and
416
proteinuria often characterizes renal patients. Therefore, the two metabolites are the
417
potential biomarker and can be used to distinguish patients with mild tubulointerstitial
418
lesions and patients with severe tubulointerstitial lesions, which indicates that the
419
selection of variables by rank aggregation is more interpretable than LASSO.
421 422 423
SC
M AN U
TE D
EP
AC C
420
RI PT
405
(Insert Table 2)
As calculating the overlapping number between the top 3 variables of
424
aggregated list and each method, only the model-free approach, t test method and
425
Relief have exactly the same as the aggregated list. The other methods have two
426
overlapping variables with the aggregated list. For this data, model-free approaches 20
ACCEPTED MANUSCRIPT are better than model-fitting and model-prediction approaches.
428
5. Conclusion
429
In this study, rank aggregation was employed to merge nine variable importance
430
analysis methods’ ranking results to produce a single "super"-list reflective of the
431
overall preference or importance within the population. Biomarker discovery was then
432
conducted on this "super"-list. The results show that good performance can be
433
obtained using the metabolites revealed by rank aggregation when compared with
434
using all variables and LASSO method. As each method generates a variable ranking
435
list just as an expert presents an opinion, it is necessary to conduct rank aggregation to
436
take all the opinion into account to produce a reliable variable ranking list. For
437
comparing the overlap number between the top variables by model-free, model-fitting
438
and model-prediction approach, the results indicate that there is no single method
439
having the best performance all the time and all the methods are data dependent. Rank
440
aggregation method is a good way to produce a reliable variable ranking for
441
biomarker discovery. It can address the problem of inconsistency between different
442
variable ranking methods to some extent. It should be noted that when employing
443
rank aggregation method, more diverse variable importance analysis methods should
444
be introduced to produce more consensus and reliable rank.
446 447
SC
M AN U
TE D
EP
AC C
445
RI PT
427
Acknowledgements This work is financially supported by the National Nature Foundation Committee
448
of P.R. China (grants no. 21275164, 21465016, 21175157 and 21375151) and also
449
supported by the Fundamental Research Funds for the Central Universities of Central 21
ACCEPTED MANUSCRIPT 450
South University (grants no. 2014zzts014). The studies meet with the approval of the
451
university’s review board.
452
Reference [1]
M. Hilario, A. Kalousis, Approaches to dimensionality reduction in proteomic biomarker studies, Brief.
RI PT
Bioinform., 9 (2008) 102. [2]
M. Dash, H. Liu, Feature selection for classification, Intell. Data. Anal., 1 (1997) 131.
[3]
I. Guyon, Andr, #233, Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res., 3
(2003) 1157.
Y. Saeys, I. Inza, P. Larrañaga, A review of feature selection techniques in bioinformatics, Bioinformatics,
23 (2007) 2507. [5]
T. Mehmood, K.H. Liland, L. Snipen, S. Sæbø, A review of variable selection methods in Partial Least
Squares Regression, Chemometr. Intell. Lab. syst., 118 (2012) 62. [6]
SC
[4]
N. Zavaljevski, F.J. Stevens, J. Reifman, Support vector machines with selective kernel scaling for protein
M AN U
classification and identification of key amino acid positions, Bioinformatics, 18 (2002) 689. [7]
L. Breiman, Random Forests, Mach. Learn, 45 (2001) 5.
[8]
R. Tibshirani, Regression Shrinkage and Selection via the Lasso, J. Roy. Stat. Soc. B, 58 (1996) 267.
[9]
H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. Roy. Stat. Soc. B, 67 (2005)
301.
[10] S. Wold, M. Sjöström, L. Eriksson, PLS-regression: a basic tool of chemometrics, Chemom. Intell. Lab. Syst, 58 (2001) 109.
[11] V. Centner, D.-L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste, C. Sterna, Elimination of
TE D
Uninformative Variables for Multivariate Calibration, Anal. Chem, 68 (1996) 3851. [12] S. Favilla, C. Durante, M.L. Vigni, M. Cocchi, Assessing feature relevance in NPLS models by VIP, Chemometr. Intell. Lab. syst., 129 (2013) 76.
[13] O.M. Kvalheim, T.V. Karstang, Interpretation of latent-variable regression models, Chemometr. Intell. Lab. syst., 7 (1989) 39.
EP
[14] O.M. Kvalheim, Interpretation of partial least squares regression models by means of target projection and selectivity ratio plots, J. Chemometr, 24 (2010) 496. [15] T. Rajalahti, R. Arneberg, F.S. Berven, K.-M. Myhr, R.J. Ulvik, O.M. Kvalheim, Biomarker discovery in mass spectral profiles by means of selectivity ratio plot, Chemometr. Intell. Lab. syst., 95 (2009) 35.
AC C
453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492
[16] Y.-H. Yun, F. Liang, B.-C. Deng, G.-B. Lai, C. Vicente Gonçalves, H.-M. Lu, J. Yan, X. Huang, L.-Z. Yi, Y.-Z. Liang, Informative metabolites identification by variable importance analysis based on random variable combination, Metabolomics, 11 (2015) 1539. [17] H.-D. Li, M.-M. Zeng, B.-B. Tan, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Recipe for revealing informative metabolites based on model population analysis, Metabolomics, 6 (2010) 353. [18] W. Cai, Y. Li, X. Shao, A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra, Chemometr. Intell. Lab, 90 (2008) 188. [19] H.-D. Li, Q.-S. Xu, Y.-Z. Liang, Random frog: An efficient reversible jump Markov Chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification, Anal. Chim. Acta, 740 (2012) 20. [20] H.-D. Li, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Recipe for Uncovering Predictive Genes Using Support Vector Machines Based on Model Population Analysis, IEEE. ACM. T. Comput. Bi, 8 (2011) 1633. 22
ACCEPTED MANUSCRIPT [21] V. Pihur, S. Datta, S. Datta, RankAggreg, an R package for weighted rank aggregation, BMC bioinformatics, 10 (2009) 62. [22] K. Kira, L.A. Rendell, The feature selection problem: Traditional methods and a new algorithm, AAAI, 2 (1992) 129. [23] I. Kononenko, in: F. Bergadano, L. De Raedt (Eds.), Machine Learning: ECML-94 (Lecture Notes in Computer Science), Springer Berlin Heidelberg, 1994, p. 171. [24] T. Naes, I.S. Helland, Relevant components in regression, Scand. J. Stat. (1993) 239.
RI PT
[25] T.N. Tran, N.L. Afanador, L.M.C. Buydens, L. Blanchet, Interpretation of variable importance in Partial Least Squares with Significance Multivariate Correlation (sMC), Chemometr. Intell. Lab. syst., 138 (2014) 153.
[26] Y.-H. Yun, D.-S. Cao, M.-L. Tan, J. Yan, D.-B. Ren, Q.-S. Xu, L. Yu, Y.-Z. Liang, A simple idea on applying large regression coefficient to improve the genetic algorithm-PLS for variable selection in multivariate calibration,
SC
Chemometr. Intell. Lab, 130 (2014) 76.
[27] Y.-H. Yun, H.-D. Li, L.R. E. Wood, W. Fan, J.-J. Wang, D.-S. Cao, Q.-S. Xu, Y.-Z. Liang, An efficient method of wavelength
interval
selection
based
on
random
Spectrochim.Acta.A., 111 (2013) 31.
frog
for
multivariate
spectral
calibration,
and applications, Umetrics, 2001.
M AN U
[28] L. Eriksson, E. Johansson, N. Kettaneh-Wold, S. Wold, Multi-and megavariate data analysis: principles
[29] L. Eriksson, T. Byrne, E. Johansson, J. Trygg, C. Vikström, Multi-and megavariate data analysis basic principles and applications, Umetrics Academy, 2013.
[30] R. Gosselin, D. Rodrigue, C. Duchesne, A Bootstrap-VIP approach for selecting wavelength intervals in spectral imaging applications, Chemometr. Intell. Lab. syst., 100 (2010) 12.
[31] B.-C. Deng, Y.-H. Yun, P. Ma, C.-C. Lin, D.-B. Ren, Y.-Z. Liang, A new method for wavelength interval
TE D
selection that intelligently optimizes the locations, widths and combinations of the intervals, Analyst, 140 (2015) 1876.
[32] Y.-H. Yun, W.-T. Wang, B.-C. Deng, G.-B. Lai, X.-b. Liu, D.-B. Ren, Y.-Z. Liang, W. Fan, Q.-S. Xu, Using variable combination population analysis for variable selection in multivariate calibration, Anal. Chim. Acta, 862 (2015) 14.
EP
[33] Y.-H. Yun, W.-T. Wang, M.-L. Tan, Y.-Z. Liang, H.-D. Li, D.-S. Cao, H.-M. Lu, Q.-S. Xu, A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration, Anal. Chim. Acta, 807 (2014) 36.
[34] B.-C. Deng, Y.-H. Yun, Y.-Z. Liang, L.-Z. Yi, A novel variable selection approach that iteratively optimizes
AC C
493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536
variable space using weighted binary matrix sampling, Analyst, 139 (2014) 4836. [35] H.-D. Li, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Model population analysis for variable selection, J. Chemometr, 24 (2010) 418.
[36] B.C. Deng, Y.H. Yun, Y.Z. Liang, D.S. Cao, Q.S. Xu, L.Z. Yi, X. Huang, A new strategy to prevent over-fitting in partial least squares models based on model population analysis, Anal. Chim. Acta, 880 (2015) 32~41. [37] P.J. GREEN, Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika, 82 (1995) 711. [38] R. Berk, Statistical Learning from a Regression Perspective (Springer Series in Statistics), Springer New York, 2008, p. 1. [39] S. Lin, J. Ding, J. Zhou, Rank aggregation of putative microRNA targets with Cross-Entropy Monte Carlo methods, (Preprint, presented at the IBC 2006 conference, Montreal) (2006). [40] V. Pihur, S. Datta, S. Datta, Weighted rank aggregation of cluster validation measures: a Monte Carlo 23
ACCEPTED MANUSCRIPT cross-entropy approach, Bioinformatics, 23 (2007) 1607. [41] T. Fearn, News 3 Interview: Katherine Bakeev 4 Meetings: NIR on the Go 6 Quasi-imaging spectrometer with programmable field of view 8 Laboratory Profile: Regional Breeders Association of Lombardy 11, 2010, p. 201014. [42] M. Zeng, Y. Liang, H. Li, M. Wang, B. Wang, X. Chen, N. Zhou, D. Cao, J. Wu, Plasma metabolic fingerprinting of childhood obesity by GC/MS in conjunction with multivariate statistical analysis, J. Pharmaceut. Biomed., 52 (2010) 265.
RI PT
[43] A.M. Proenza, P. Roca, C. Crespı́, I. Lladó , A. Palou, Blood amino acid compartmentation in men and women with different degrees of obesity, J. Nutr. Biochem., 9 (1998) 697~704.
[44] M.W. Hulver, J.R. Berggren, R.N. Cortright, R.W. Dudek, R.P. Thompson, W.J. Pories, K.G. MacDonald, G.W. Cline, G.I. Shulman, G.L. Dohm, J.A. Houmard, Skeletal muscle lipid metabolism with obesity, Am. J. Physiol-Endoc. M., 284 (2003) E741.
energy expenditure, Am. J. Clin. Nutr., 82 (2005) 320~326.
SC
[45] C.L. Kien, J.Y. Bunn, F. Ugrasbul, Increasing dietary palmitic acid decreases fat oxidation and daily
[46] N.G. Psihogios, R.G. Kalaitzidis, S. Dimou, K.I. Seferiadis, K.C. Siamopoulos, E.T. Bairaktari, Evaluation of tubulointerstitial lesions' severity in patients with glomerulonephritides: An NMR-Based metabonomic study, J. Proteome. Res., 6 (2007) 3760.
AC C
EP
TE D
555 556
M AN U
537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554
24
ACCEPTED MANUSCRIPT Figure Caption
558
Fig. 1. Variable ranking lists of different methods and aggregated ranking list by rank
559
aggregation for the childhood overweight data
560
Fig. 2. DCV results of rank aggregation through sequentially addition for modeling
561
according to variable ranking index (the number in X axis), e.g. 1 means the top
562
variable is used, and 3 means that the top 3 variables are used for DCV. a: children
563
obesity data, the top 3 variables have the highest prediction accuracy and AUC value,
564
86.21% and 0.8534; b: tubulointerstitial lesions data, the top 3 variables have the
565
highest prediction accuracy and AUC value, 100% and 0.96;
566
Fig. 3. Top 30 variables ranking lists of different methods and aggregated ranking list
567
by rank aggregation for the tubulointerstitial lesions data
M AN U
SC
RI PT
557
AC C
EP
TE D
568
25
ACCEPTED MANUSCRIPT Table caption
570
Table 1. 10-fold DCV results of all variables, rank aggregation and LASSO on the
571
childhood overweight data.
572
Table 2. 10-fold DCV results of all variables, rank aggregation and LASSO on the
573
tubulointerstitial lesions data.
RI PT
569
AC C
EP
TE D
M AN U
SC
574
26
ACCEPTED MANUSCRIPT
nVara
Sensitivity (%)
Specificity (%)
Accuracy (%)
AUC
The selected variables
All variables
30
81.25 (13/16)
46.15 (6/13)
65.52 (19/29)
0.6346
/
Rank aggregation
3
87.50 (14/16)
84.62 (11/13)
86.21 (25/29)
0.8534
12, 23, 5
LASSO
7
81.25 (13/16)
84.62 (11/13)
82.76 (24/29)
0.8077
12, 13, 27, 19, 5, 23, 2
the number of selected variables.
SC
Method
M AN U
a
RI PT
Table 1. 10-fold DCV results of all variables, rank aggregation and LASSO on the childhood overweight data.
Table 2. 10-fold DCV results of all variables, rank aggregation and LASSO on the tubulointerstitial lesions data. Sensitivity (%)
All variables
200
92.0(23/25)
Rank aggregation
3
100(25/25)
LASSO
8
100(25/25)
Specificity (%)
Accuracy (%)
AUC
The selected variables
100(25/25)
96.0(48/50)
0.9200
/
100(25/25)
100(50/50)
0.9600
142, 138, 182
100(25/25)
100(50/50)
0.9600
32, 33, 182, 126, 141, 138, 142, 105
TE D
nVar
AC C
EP
Method
AC C EP TE D
M AN US C
RI PT
AC C EP TE D
M AN US C
RI PT
AC C EP TE D
M AN US C
RI PT