Variable importance analysis based on rank aggregation with applications in metabolomics for biomarker discovery.

Accepted Manuscript Variable importance analysis based on rank aggregation with applications in metabolomics for biomarker discovery Yong-Huan Yun, Bai-Chuan Deng, Dong-Sheng Cao, Wei-Ting Wang, Yi-Zeng Liang PII:

S0003-2670(16)30021-6

DOI:

10.1016/j.aca.2015.12.043

Reference:

ACA 234336

To appear in:

Analytica Chimica Acta

Received Date: 15 September 2015 Revised Date:

28 December 2015

Accepted Date: 30 December 2015

Please cite this article as: Y.-H. Yun, B.-C. Deng, D.-S. Cao, W.-T. Wang, Y.-Z. Liang, Variable importance analysis based on rank aggregation with applications in metabolomics for biomarker discovery, Analytica Chimica Acta (2016), doi: 10.1016/j.aca.2015.12.043. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Variable importance analysis based on rank aggregation

2

with applications in metabolomics for biomarker discovery

3

Yong-Huan Yun1, Bai-Chuan Deng2, Dong-Sheng Cao3, Wei-Ting Wang1, Yi-Zeng Liang1,∗ 1

4

RI PT

1

College of Chemistry and Chemical Engineering, Central South University, Changsha 410083, P.R. China

5 6

2

7

3

8

Abstract

9

Biomarker discovery is one important goal in metabolomics, which is typically

10

modeled as selecting the most discriminating metabolites for classification and often

11

referred to as variable importance analysis or variable selection. Until now, a number

12

of variable importance analysis methods to discover biomarkers in the metabolomics

13

studies have been proposed. However, different methods are mostly likely to generate

14

different variable ranking results due to their different principles. Each method

15

generates a variable ranking list just as an expert presents an opinion. The problem of

16

inconsistency between different variable ranking methods is often ignored. To address

17

this problem, a simple and ideal solution is that every ranking should be taken into

18

account. In this study, a strategy, called rank aggregation, was employed. It is an

19

indispensable tool for merging individual ranking lists into a single "super"-list

20

reflective of the overall preference or importance within the population. This

21

"super"-list is regarded as the final ranking for biomarker discovery. Finally, it was

College of Animal Science, South China Agricultural University, Guangzhou 510642, P.R. China

AC C

EP

TE D

M AN U

SC

College of Pharmaceutical Sciences, Central South University, Changsha 410083, P.R. China

∗

Corresponding author. Tel.:+86-0731-88830831.

E-mail address: [email protected] (Y.Z.Liang) 1

ACCEPTED MANUSCRIPT used for biomarkers discovery and selecting the best variable subset with the highest

23

predictive classification accuracy. Nine methods were used, including three univariate

24

filtering and six multivariate methods. When applied to two metabolic datasets

25

(Childhood overweight dataset and Tubulointerstitial lesions dataset), the results show

26

that the performance of rank aggregation has improved greatly with higher prediction

27

accuracy compared with using all variables. Moreover, it is also better than penalized

28

method, least absolute shrinkage and selectionator operator (LASSO), with higher

29

prediction accuracy or less number of selected variables which are more interpretable.

30

Key words: Variable importance; Variable ranking; Biomarker discovery; Rank

31

aggregation; Metabolomics;

AC C

EP

TE D

32

M AN U

SC

RI PT

22

2

ACCEPTED MANUSCRIPT

1. Introduction

34

Metabolomics is an emerging field which combines strategies to identify and quantify

35

cellular metabolites present in organisms, cells, or tissues using advanced analytical

36

techniques with the application of statistical and multi-variant methods. One target of

37

metabolomics research is to discover biomarkers, which can aid in the diagnosis of

38

many diseases in the metabolic system, biological and clinical guidance. The

39

discovery of biomarkers is typically modeled as selecting the most discriminating

40

variables (metabolites) for classification (e.g., discriminating diseased versus healthy)

41

[1], which is often referred to as variable importance analysis or variable selection in

42

the language of statistics and machine learning. Any variable importance analysis

43

method can be turned into variable selection by introducing a threshold on variable

44

importance values. In the past two decades, a large amount of papers about biomarker

45

discovery in the metabolomics studies have employed the statistical method [2-4]

46

including univariate filtering method and multivariate methods (principal component

47

analysis (PCA), partial least squares-linear discriminant analysis (PLS-DA) [5],

48

support vector machine (SVM) [6] , random forest (RForest) [7] and penalized

49

method like least absolute shrinkage and selectionator operator (LASSO) [8] and

50

elastic net [9]). Ranking of variable (metabolite) importance is often carried out with

51

the help of the statistical method. The ranking is assigning a measure of importance to

52

each variable. Then, a subset of all metabolites could be identified by setting a

53

threshold value. Variable importance analysis methods consist of model-free and

54

model-based approaches. One group of the model-free approaches is based on simple

AC C

EP

TE D

M AN U

SC

RI PT

33

3

ACCEPTED MANUSCRIPT univariate test statistics methods without using the model information, such as t test,

56

fold change and Wilcoxon rank-sum test. They reflect whether the difference between

57

healthy and diseased groups’ averages is statistically significant. The other group is

58

based on the features of relationship between variables and classification label, such

59

as correlation coefficient, information gain, Euclidean distance and mutual

60

information. As for model-based approaches, they are tied to the model performance

61

including model-fitting and model-prediction approach. Model-fitting approach is

62

commonly used when the established model fits itself, which contains partial least

63

squares (PLS) weights [10], PLS loadings [10], regression coefficient (RC) [11],

64

variable importance in projection (VIP) [12] and selectivity ratio (SR) [13-15].

65

Model-prediction approach is based on the model prediction performance by means of

66

resampling methods including jackknife, bootstrap and cross validation. Variable

67

importance analysis based on random variable combination (VIAVC) [16],

68

subwindow permutation analysis (SPA) [17], uninformative variable elimination

69

(UVE) [18], random frog (RFrog) [19], margin influence analysis (MIA) [20],

70

RForest and LASSO are assigned to the model-prediction approach. To date,

71

researchers have often applied a lot of different variable important analysis methods

72

to get as much as possible out of their data and present only the most favorable results,

73

and then interpret the biomarkers that have a good performance with high

74

classification accuracy. However, it is possible that there exist many different subsets

75

of variables from different methods that can achieve the same or similar predictive

76

accuracy. Different variable importance analysis methods are mostly likely to generate

AC C

EP

TE D

M AN U

SC

RI PT

55

4

ACCEPTED MANUSCRIPT different variable rankings due to their different principles even when considering

78

top-ranking variables. For instance, method “A” identifies the top three variables as

79

potential biomarkers, whereas these biomarkers are not recognized as top ranking by

80

method “B”. Both methods have discovered two different variable subsets that have

81

the same classification accuracy. Consequently, it cannot assess which method is

82

accurate and feasible. Moreover, inconsistency between variable rankings are often

83

ignored in metabolomics research when presenting a new data set, because they would

84

make the interpretation of results more confusing and arouse doubts on the reliability

85

of their data. As Franklin D. Roosevelt once said that "there are as many opinions as

86

there are experts". Each method generates a variable ranking list just as an expert

87

presents an opinion. The multiplicity of methods and the question of how to deal with

88

the different ranking results are very general issues in the analysis of variable

89

importance. To address this problem, a simple and ideal solution is to take every

90

ranking into account.

TE D

M AN U

SC

RI PT

77

In this study, a strategy that can combine all methods’ ranking results, called rank

92

aggregation firstly proposed by Vasyl et al. [21], was introduced. It is an indispensable

93

tool to merge individual ranking lists into a single "super"-list reflective of the overall

94

preference or importance within the population. In other word, it aims to find a

95

"super"-list which would be as "close" as possible to all individual ranking lists

96

simultaneously. This "super"-list is regarded as the final ranking for biomarker

97

discovery. In this work, we employed nine variable importance analysis methods from

98

the model-free and model based approaches, including t test, Wilcoxon rank-sum test,

AC C

EP

91

5

ACCEPTED MANUSCRIPT Relief, PLS-RC, PLS-VIP, SPA, RFrog, MIA and RForest. All nine methods were

100

used on the two metabolomics data to generate the rankings of variable importance.

101

Rank aggregation was then used to ensemble all the variable ranking and produce a

102

"super" ranking list. Finally, this ranking list was used to select the best variable

103

subset with the highest predictive classification accuracy.

104

2. Methods and theory

105

2.1 Variable importance analysis methods

106

As illustrated in the section of introduction, the variable importance analysis methods

107

can be separated into two groups as model-free and model-based approaches. In this

108

section, the methods we used for rank aggregation are introduced in brief.

109

2.1.1 Model-free approaches

110

Model-free approaches contain two different strategies. One is based on univariate

111

test statistics methods, such as t test and Wilcoxon rank-sum test. The other one is

112

based on the statistical features between variables and classification label such as

113

correlation coefficient, information gain, Euclidean distance and Mutual information.

114

T test statistic

115

A t test’s statistical significance indicates whether or not the difference between

116

healthy and diseased groups’ averages. The smaller the P-value of t test, the larger the

117

significance difference of two groups. Each variable of data X will have a P-value

118

when calculating the difference between healthy and diseased groups’ averages by t

119

test. All variables could be ranked with the P-value by ascend.

120

Wilcoxon rank-sum test

AC C

EP

TE D

M AN U

SC

RI PT

99

6

ACCEPTED MANUSCRIPT Wilcoxon rank-sum test is a nonparametric which is based solely on the order in

122

which the samples from the two groups. The smaller the P-value of Wilcoxon

123

rank-sum test, the larger the significance difference of two groups. Thus, as like the t

124

test, all variables could be ranked based on the P-value by ascend. Since it compares

125

to rank sums, the Wilcoxon rank-sum test is more robust than the t-test because it is

126

less likely to manifest spurious results based on the existing of outliers.

127

Relief

128

Relief algorithm [22, 23] is a correlation-based method assigning a “relevance”

129

weight to each variable, which is meant to denote the relevance and redundancy

130

analysis of the variable to the target concept. The core idea of Relief is to estimate

131

variables based on difference between the selected sample and the two nearest

132

samples of the same and opposite class (the “near-hit” and “near-miss”). In fact,

133

Relief attempts to approximate the following difference of probabilities for the weight

134

of a variable A, WA:

135

WA = K (different value of A nearest sample from different class) - K (different value

136

of A nearest sample from same class)

SC

M AN U

TE D

EP

AC C

137

RI PT

121

The principle is that good variable should distinguish between samples from

138

different classes and should have the same value for samples from the same class. The

139

output of Relief is the rank of the variable based on WA. The larger the WA, the higher

140

place the corresponding variable obtain.

141

2.1.2 Model-based approaches

142

Model-based approaches are related to the model performance including fitting and 7

ACCEPTED MANUSCRIPT prediction performance. Model-fitting approach is commonly used in the PLS-DA

144

modeling. The constructed PLS-DA model fits itself and then generates a series of

145

information, such as weights, loadings, RC, VIP and SR. Here, we briefly introduce

146

the selected model-fitting methods, including RC and VIP.

147

PLS-DA is derived from PLS regression [6] and involves forming a regression model

148

between the X and y. Given a sample matrix X with dimensions n×k, it contains n

149

samples in rows and k variables in columns, and the corresponding class label y of

150

size n×1 which includes 1 or -1 for binary classification case and factorizations with A

151

components.

M AN U

SC

RI PT

143

There are many versions of the PLS algorithm, for simplicity, we here focus only

153

on the single response case called PLS1 or simply PLS from the orthogonal score PLS.

154

Assume that A (A≤k) is equal to the number of relevant components for prediction,

155

based on the definition by Naes and Helland [24].

156

1. Calculate the PLS weight vector w

159

160

EP

158

163

(1)

t

=

Xw 2 ∑w

(2)

3. Calculate the X loadings by

161

162

w = X'y

2. Calculate the scores, which are given by

AC C

157

TE D

152

p=

t' X 2 ∑t

(3)

q=

y' t 2 ∑t

(4)

4. Calculate the y loading by

8

ACCEPTED MANUSCRIPT 164

5. Deflate X and y by subtracting the contribution of t to obtain the residual data

165

matrix

X residual = X − tp

(5)

167

yresidual = y − tq

(6)

RI PT

166

168

6. For PLS component a = 1, 2, …, A, the algorithm runs from step 1 to step 5

169

through replacing both X and y by their residuals.

Once the PLS model is built, it is then possible to predict the value of y for both

171

the original data and test set samples (known origins), or future samples (unknown

172

origins), as follows:

173

The relationship between X and y can be expressed by

174

y = X β + f = Tq + f

175

Where β is a regression coefficient vector with dimensions k×1; thus, an unknown

176

sample value of y can be predicted by

179

M AN U

TE D

(8)

The estimation of β can be obtained as follows

EP

178

ˆy = xb

(7)

β = w ( pw ) q −1

(9)

AC C

177

SC

170

180

Regression coefficients (RC, β)

181

The vector of RC (β) is a single measure of relationship between each variable and the

182

corresponding response (class label). Variables having small absolute values of this

183

filter measure can be eliminated as the uninformative or interfering variables [18,

184

25-27]. Furthermore, instead of relying on the single observation on the value of β,

185

distribution or variation is also taken into consideration. In this way, resampling 9

ACCEPTED MANUSCRIPT methods such as bootstrapping and Monte Carlo sampling may give more efficient

187

regression coefficients estimation but at the expense of extra computation time[18].

188

With the measure of β, the larger the value of regression coefficient, the more

189

important the variable.

190

Variable importance in the projection (VIP)

191

VIP termed by [28] is to accumulate the importance of each variable j being reflected

192

by w from each component, a. VIP measure is defined as A

(

p ∑  qa2 t'a ta  a-1 

)(w

aj

/ wa

SC

VIP j =

2

)

(

A / ∑ q 2 t' t  a-1 a a a

M AN U

193

RI PT

186

)

(10)

194

VIP is the proportion of the fraction of the explained variance of X expressed by

195

qa2 t'a ta weighted by the covariance between X and y, represented by waj / wa , for

196

each variable j over all components. Furthermore, variable j can be eliminated if VIPj

197

< u for defined threshold u ∈ [0, ∞ ) by users. It is generally accepted that a variable

198

should be selected if “VIP scores > 1” which is typically used as the criterion for

199

important variable selection [5, 29, 30]. In this study, the variable would be sorted by

200

descend based on their VIP scores.

TE D

EP

As for Model-prediction approach, it is on the basis of the model prediction

AC C

201

2

202

performance by means of resampling methods.

203

Subwindow permutation analysis (SPA)

204

SPA [17] coupled with PLS-DA investigate the importance of each variable by

205

implicitly taking the synergetic effect of the rest of the variables into account. It

206

assesses the variable importance of every variable by comparing the distributions of

207

normal prediction errors (NPEs) and permuted prediction errors (PPEs) of the same 10

ACCEPTED MANUSCRIPT variable with the strategy of model population analysis (MPA) [16, 31-36]. MPA aims

209

at exploring the statistical information from a large ‘population’ of sub-models which

210

are built on different sub-datasets generated from the original dataset using the Monte

211

Carlo (MC) sampling technique. The use of variable subset makes SPA more efficient

212

and faster for the large dimensionality of some datasets. SPA can finally output a

213

conditional from two distributions of NPEs and PPEs for each variable. The smaller

214

the P-value, the more important the corresponding variable.

215

Random frog (RFrog)

216

RFrog applies a reversible jump Markov Chain Monte Carlo (RJMCMC)-like

217

algorithm [37] to assess the importance of variable. It conducts a search in the model

218

space through both fixed and trans-dimensional moves between different models

219

generated by MC Sampling, and then a pseudo-MCMC chain is computed and used to

220

compute a selection probability which can be regarded as a criterion to measure the

221

relevance of each variable. Variables can then be ranked in terms of their selection

222

frequency from a population of obtained variables subsets. If the variable is selected

223

most frequently, it means that it is the most important variable.

224

Margin Influence Analysis (MIA)

225

MIA [20] is based on MPA and designed to work with support vector machines (SVM)

226

for variable selection. It reveals statistically significant variables on the basis of the

227

fact that the margin of SVM plays an important role in the generalization performance

228

of SVM models. Firstly, MC sampling is employed to generate a population of

229

sub-data (different variable combinations) in the variable space; Sub-model is then

AC C

EP

TE D

M AN U

SC

RI PT

208

11

ACCEPTED MANUSCRIPT built on each sub-data using SVM. Finally, the importance of each variable is assessed

231

in pairs by the two groups of margins distribution of all constructed SVM classifiers

232

(with and without the specific variable in a large variable combination). The

233

nonparametric test, Mann-Whitney U test, is used for statistical analysis of margin

234

distribution to check whether the increment of margin is significant, leading to a

235

P-value for each variable. The smallest P-value represents the most informative

236

variables.

237

Random Forest (RForest)

238

RForest developed by Breiman [7] is a classification algorithm that ensembles

239

unpruned decision trees, each of which is constructed via a tree classification

240

algorithm such as classification and regression tree (CART) [38]. Each tree is grown

241

on an independently drawn bootstrap sample and a randomly selected small group of

242

variables from the training data. The forest decides the class having the majority vote

243

after it is formed. Variable importance of RForest is measured by the increase in

244

prediction error if the values of that variable are permuted across the out-of-bag (OOB)

245

observations. For each variable, the classification accuracies of the OOB samples with

246

and without permutation of the values are computed. Then they are averaged over the

247

entire ensemble and divided by the standard deviation over the entire ensemble to

248

produce the importance of each variable.

249

2.2 Rank aggregation

250

Rank aggregation [21] techniques aim to provide a general framework to objectively

251

perform the essential aggregation. With the rapid growth of metobolomics studies, the

AC C

EP

TE D

M AN U

SC

RI PT

230

12

ACCEPTED MANUSCRIPT potential application of rank aggregation becomes even more necessary to solve

253

problem of inconsistency between different variable ranking methods. When assessing

254

the variable importance, the above mentioned methods will give different variables

255

ranking results. One of the major strengths of rank-based aggregation is the ability to

256

merge all ranking lists generated from different algorithms.

RI PT

252

With the different variable ranking lists, the objective function need to be defined

258

at first in order to perform the rank aggregation in the framework of an optimization

259

problem. In this sense, we intend to find a "super"-list which will be as "close" as

260

possible to all individual variables ranking lists simultaneously. Thus, a rank

261

aggregation problem can be turned into a minimization problem in all possible

262

combinations of variable ranking lists as follows:

M AN U

SC

257

m

φ (δ ) = ∑ wi d (δ ,Li )

263

(18)

TE D

i =1

where δ is a proposed ranking list of length k = Li , wi is the important weight

265

related with list Li, d is a distance function and Li is the ith ranking list.

EP

264

The idea is to find δ* which will minimize all the distance between δ* and L'i .

267

δ ∗ = arg min ∑ wi d (δ ,Li )

268

AC C

266

m

i =1

(19)

In this work, one of the most popular distance functions, Spearman footrule

269

distance is used to measure the distance between ordered lists [21]. Spearman footrule

270

distance between Li and any ranking list δ can be defined as

271

S ( d,Li ) = ∑ r δ ( t ) - r Li ( t ) t ⊂ Li

272

(20)

where r Li ( A) is the rank of A in the list Li (1 means the "best") if A is within the top 13


k, and be equal to k + 1, otherwise; rδ ( A) is defined at the same. It is nothing more than the sum of the absolute differences between the ranks of

275

all individual elements from both ranking lists combined. It is a very intuitive metric

276

for comparing two ranking lists. The smaller the value of the metric, the more similar

277

the ranking lists.

RI PT

274

Cross-entropy (CE) Monte Carlo algorithm [39] is used as the minimization

279

algorithm to find the optimal variable ranking list. The details of the CE algorithm

280

could be found in [40]. Briefly, the CE is a stochastic search algorithm in the matrices

281

of 0-1 values with columns summing to one and rows summing to at most one since

282

any ranking list can be uniquely mapped to this matrix.

283

2.3 The selection of optimal variable subset

284

This step is to discover the potential biomarker. The optimal variable subset is

285

selected according to the aggregated ranking list of all selected variable importance

286

analysis methods. The variable based on the "super"-list is sequentially added to build

287

a PLS-DA model whose performance is assessed using 10-fold double cross

288

validation (DCV) [41]. The Area Under the receiver operating characteristic Curve

289

(AUC) value of 10-fold double cross validation (DCV) of PLS-DA is employed to

290

assess each variable subset. The variable whose addition results in the maximum AUC

291

value is chosen as the threshold. In other words, for example, when the fifth variable

292

on the ranking list has the maximum AUC value, the top five variables is the optimal

293

variable subset with the best performance, 10-fold DCV AUC value.

294

3. Software

AC C

EP

TE D

M AN U

SC

278

14

ACCEPTED MANUSCRIPT All the calculations were performed in MATLAB (Version 2013a, the MathWorks,

296

Inc.) and R (i386 3.1.2 version) programs on a general-purpose computer with Intel®

297

Core® i5 3.2 GHz CPU and 3GB of RAM carring the Microsoft Windows XP as the

298

operating system. The in house MATLAB code and our written code

299

(http://www.libpls.net) were used. The R package, Rankaggre, was used in this study.

300

4. Results and discussion

301

In addition, the maximum latent variable was set to 5 in the cross validation of

302

PLS-DA. All the data was first autoscaled to have zero mean and unit variance before

303

modeling except random forest model.

304

4.1 Childhood overweight data

305

This data was obtained by GC-MS analysis on the blood plasma samples collected

306

from 16 children of normal weight and 13 overweight children at the Xiangya

307

Hospital of Central South University in Changsha, China. The clinical experiments

308

were approved by Xiangya Institutional Human Subjects Committee. All the samples

309

were performed on a Shimadzu GC–MS-QP2010 gas chromatography-mass

310

spectrometry instrument (Shimadzu, Kyoto, Japan), and thirty metabolites were

311

identified and accurately quantified using the internal standard. More details about

312

this data are given in [42].

SC

M AN U

TE D

EP

AC C

313

RI PT

295

Nine variable importance methods were performed on this data. The results are

314

shown in Fig. 1. As can be seen from the Fig. 1, different methods generate different

315

results, which confuse our selection. Thus, rank aggregation is necessary to integrate

316

different results and provide a consensual variable ranking list. This ranking list is 15


reliable and used to select the best variable subset with the highest predictive

318

classification accuracy.

319

(Insert Figure 1)

RI PT

320 321

As for revealing the best variable subset, 10-fold DCV of PLS-DA was

323

conducted by means of adding the top variables sequentially according to the

324

aggregated rank. The results of 10-fold DCV of PLS-DA for each subset are displayed

325

in Table 1 and Fig. 2a. The top 3 metabolites (NO. 12, 23 and 5 variables) as the best

326

subset have the maximum AUC value of 0.8534 and highest prediction accuracy of

327

0.8621. Rank aggregation aims to merge individual ranking lists into a single

328

"super"-list reflective of the overall preference or importance within the population.

329

Thus, these 3 metabolites have been commonly given a high ranking score. The

330

NO.12 variable appears seven times in the first place, one time in the third place and

331

one time in the fourth place. It thus gets the first place in the aggregated ranking list.

332

For the NO.23 variable, it gets five times in the second place, one time in the third

333

place, one time in the fourth place and two times in seventh place. Although the place

334

of NO.23 variable is higher than NO.12 variable in the ranking of Wilcoxon rank-sum

335

test, it is not better than NO.12 variable which obtains the first place by seven

336

methods. Thus, assessing the importance of each variable should be based on the

337

opinion of each method. It is unreliable to assess the importance only relying on the

338

result of one method. As another example, the NO. 20 variable has acquired the first

AC C

EP

TE D

M AN U

SC

322

16

ACCEPTED MANUSCRIPT place by the Wilcoxon rank-sum test, but it get fluctuated ranking score by other

340

methods as 4th, 8th, 9th, 10th, 12th, 16th and 23th places, leading to the 9th place in the

341

aggregated ranking list. Therefore, it is necessary to perform rank aggregation to

342

integrate the different ranking results and produce a consensual and reliable result.

RI PT

339

In order to make a comparison with the biomarker discovery by rank aggregation,

344

the penalized method, LASSO was used. LASSO is a shrinkage and variable selection

345

method which minimizes the residuals sum of squared errors subject to the sum of the

346

absolute values of the coefficients being less than a constant. This constraint allows

347

LASSO to perform the approach of variable selection because the shrinkage of the

348

coefficients leads to that some coefficients can be shrunk exactly to zero. As a result,

349

it reduces the estimation variance but produces the informative variables and provides

350

an interpretable final model. The variables corresponding to the zero coefficients are

351

uninformative. The selected variables by LASSO having the non-zero coefficients are

352

shown in Table 1. It can be seen that the selected variables by rank aggregation is

353

included in the selection of variables by LASSO. However, rank aggregation performs

354

better than LASSO with higher prediction accuracy and less number of variables,

355

which indicates that rank aggregation is more reliable and the selection of variables is

356

consensus and more interpretable.

M AN U

TE D

EP

AC C

357

SC

343

358

(Insert Table 1)

359

(Insert Figure 2)

360 17

ACCEPTED MANUSCRIPT As the top 3 variables are finally selected as the best subset, the overlapping

362

number between the top 3 variables of aggregated list and each method is used to

363

simply assess the performance of each method. The top 3 variables of SPA, RFrog and

364

MIA are exactly the same as the aggregated list. VIP performs the worst as there is no

365

overlapping variable in the top 3 ones. For this data, model-prediction approaches are

366

better than model-fitting and model-free approaches

RI PT

361

As for the selected metabolites, NO. 12, 23 and 5 variables, they are

368

β-Hydroxybutyric acid, glyceric acid and palmitic acid (PA), respectively. As reported

369

in the study of Proenza et al. [43], the level of β-Hydroxybutyrate of overweight men

370

is higher than lean men, which indicates the close relationship between

371

β-Hydroxybutyrate and overweight diagnose, and also correctly identifies this

372

metabolite as an important variable. With regard to glyceric acid, it can be derived

373

from oxidation of glycerol and some phosphate derivatives of glyceric acid. They

374

include 2-phosphoglyceric acid, 3-phosphoglyceric acid, 2,3-bisphosphoglyceric acid

375

and 1,3-bisphosphoglyceric acid, which are the very important biochemical

376

intermediates of lipid metabolism involved in the process of overweight development

377

[44]. Besides, the increase of PA in diet can reduce fat oxidation and may also

378

increase the risk in obesity and insulin resistance from the work of Kien et al. [45]. It

379

is possible to find that PA works importantly in the risk of obesity. Thus, PA is directly

380

related with overweight diagnosis.

381

4.2 Tubulointerstitial lesions data (TLS)

382

TLS data was obtained by GC-MS analysis on the urinary samples collected from 25

AC C

EP

TE D

M AN U

SC

367

18

ACCEPTED MANUSCRIPT patients with mild tubulointerstitial lesions and 25 patients with severe

384

tubulointerstitial lesions at the Department of Nephrology of the University Hospital

385

of Ioannina, each with 200 bins. All study participants gave informed consent for the

386

investigation, which was approved by the Ethical Committee of the University

387

Hospital of Ioannina. More detailed descriptions is given in the reference [46].

RI PT

383

TLS data has 200 variables, most of which are uninformative. Thus, Rank

389

aggregation just generated the top aggregated 30 variables in this work. The results of

390

variable ranking by nine variable importance methods and their aggregation are

391

shown in Fig. 3. It can be seen visually that different methods generate different

392

results. When rank aggregation was performed to integrate the different results, it

393

provided a consensual variable ranking list that considered every opinion of nine

394

methods. The merged list was based on the rank score of each method. From the Fig.

395

3, NO.142 variable is selected as the first place in the aggregated list because most of

396

the methods give it a high rank score. NO.182 variable has got the first place by two

397

methods (Relief and RFrog), while NO. 138 variable has also got the first place by

398

two methods (t test and VIP), but NO.182 variable obtains more fluctuated results

399

than NO. 138. Therefore, NO. 138 variable gets the higher place than NO.182. It

400

demonstrates again that assessing the importance of each variable should be based on

401

all methods’ results not relying on one method. The aggregated ranking list was then

402

used to select the best variable subset with the highest predictive classification

403

accuracy for biomarker discovery. 10-fold DCV of PLS-DA was conducted by means

404

of adding the top variables sequentially according to the aggregated ranking list. The

AC C

EP

TE D

M AN U

SC

388

19

ACCEPTED MANUSCRIPT results of 10-fold DCV of PLS-DA for each subset are displayed in Fig. 2b and Table

406

2. The top 3 metabolites (NO. 142, 138, and 182 variables) have been selected as the

407

best subset with the maximum AUC value of 0.96 and highest prediction accuracy of

408

1. When compared with LASSO method, rank aggregation selects less number of

409

variables than LASSO with the same prediction performance (Table 2). It proves that

410

different variable subsets from different methods can achieve the same predictive

411

performance. The selected variables by rank aggregation are also included in the

412

selection of variables by LASSO. In the original reference study, these three variables

413

belong to two biomarkers: citrate and proteins. They were found in relatively higher

414

levels in patients. A previous study [42] has reported that the initiation of

415

tubulointerstitial lesions is characterized by decreased excretion of citrate and

416

proteinuria often characterizes renal patients. Therefore, the two metabolites are the

417

potential biomarker and can be used to distinguish patients with mild tubulointerstitial

418

lesions and patients with severe tubulointerstitial lesions, which indicates that the

419

selection of variables by rank aggregation is more interpretable than LASSO.

421 422 423

SC

M AN U

TE D

EP

AC C

420

RI PT

405

(Insert Table 2)

As calculating the overlapping number between the top 3 variables of

424

aggregated list and each method, only the model-free approach, t test method and

425

Relief have exactly the same as the aggregated list. The other methods have two

426

overlapping variables with the aggregated list. For this data, model-free approaches 20

ACCEPTED MANUSCRIPT are better than model-fitting and model-prediction approaches.

428

5. Conclusion

429

In this study, rank aggregation was employed to merge nine variable importance

430

analysis methods’ ranking results to produce a single "super"-list reflective of the

431

overall preference or importance within the population. Biomarker discovery was then

432

conducted on this "super"-list. The results show that good performance can be

433

obtained using the metabolites revealed by rank aggregation when compared with

434

using all variables and LASSO method. As each method generates a variable ranking

435

list just as an expert presents an opinion, it is necessary to conduct rank aggregation to

436

take all the opinion into account to produce a reliable variable ranking list. For

437

comparing the overlap number between the top variables by model-free, model-fitting

438

and model-prediction approach, the results indicate that there is no single method

439

having the best performance all the time and all the methods are data dependent. Rank

440

aggregation method is a good way to produce a reliable variable ranking for

441

biomarker discovery. It can address the problem of inconsistency between different

442

variable ranking methods to some extent. It should be noted that when employing

443

rank aggregation method, more diverse variable importance analysis methods should

444

be introduced to produce more consensus and reliable rank.

446 447

SC

M AN U

TE D

EP

AC C

445

RI PT

427

Acknowledgements This work is financially supported by the National Nature Foundation Committee

448

of P.R. China (grants no. 21275164, 21465016, 21175157 and 21375151) and also

449

supported by the Fundamental Research Funds for the Central Universities of Central 21


South University (grants no. 2014zzts014). The studies meet with the approval of the

451

university’s review board.

452

Reference [1]

M. Hilario, A. Kalousis, Approaches to dimensionality reduction in proteomic biomarker studies, Brief.

RI PT

Bioinform., 9 (2008) 102. [2]

M. Dash, H. Liu, Feature selection for classification, Intell. Data. Anal., 1 (1997) 131.

[3]

I. Guyon, Andr, #233, Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res., 3

(2003) 1157.

Y. Saeys, I. Inza, P. Larrañaga, A review of feature selection techniques in bioinformatics, Bioinformatics,

23 (2007) 2507. [5]

T. Mehmood, K.H. Liland, L. Snipen, S. Sæbø, A review of variable selection methods in Partial Least

Squares Regression, Chemometr. Intell. Lab. syst., 118 (2012) 62. [6]

SC

[4]

N. Zavaljevski, F.J. Stevens, J. Reifman, Support vector machines with selective kernel scaling for protein

M AN U

classification and identification of key amino acid positions, Bioinformatics, 18 (2002) 689. [7]

L. Breiman, Random Forests, Mach. Learn, 45 (2001) 5.

[8]

R. Tibshirani, Regression Shrinkage and Selection via the Lasso, J. Roy. Stat. Soc. B, 58 (1996) 267.

[9]

H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. Roy. Stat. Soc. B, 67 (2005)

301.

[10] S. Wold, M. Sjöström, L. Eriksson, PLS-regression: a basic tool of chemometrics, Chemom. Intell. Lab. Syst, 58 (2001) 109.

[11] V. Centner, D.-L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste, C. Sterna, Elimination of

TE D

Uninformative Variables for Multivariate Calibration, Anal. Chem, 68 (1996) 3851. [12] S. Favilla, C. Durante, M.L. Vigni, M. Cocchi, Assessing feature relevance in NPLS models by VIP, Chemometr. Intell. Lab. syst., 129 (2013) 76.

[13] O.M. Kvalheim, T.V. Karstang, Interpretation of latent-variable regression models, Chemometr. Intell. Lab. syst., 7 (1989) 39.

EP

[14] O.M. Kvalheim, Interpretation of partial least squares regression models by means of target projection and selectivity ratio plots, J. Chemometr, 24 (2010) 496. [15] T. Rajalahti, R. Arneberg, F.S. Berven, K.-M. Myhr, R.J. Ulvik, O.M. Kvalheim, Biomarker discovery in mass spectral profiles by means of selectivity ratio plot, Chemometr. Intell. Lab. syst., 95 (2009) 35.

AC C

453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492

[16] Y.-H. Yun, F. Liang, B.-C. Deng, G.-B. Lai, C. Vicente Gonçalves, H.-M. Lu, J. Yan, X. Huang, L.-Z. Yi, Y.-Z. Liang, Informative metabolites identification by variable importance analysis based on random variable combination, Metabolomics, 11 (2015) 1539. [17] H.-D. Li, M.-M. Zeng, B.-B. Tan, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Recipe for revealing informative metabolites based on model population analysis, Metabolomics, 6 (2010) 353. [18] W. Cai, Y. Li, X. Shao, A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra, Chemometr. Intell. Lab, 90 (2008) 188. [19] H.-D. Li, Q.-S. Xu, Y.-Z. Liang, Random frog: An efficient reversible jump Markov Chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification, Anal. Chim. Acta, 740 (2012) 20. [20] H.-D. Li, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Recipe for Uncovering Predictive Genes Using Support Vector Machines Based on Model Population Analysis, IEEE. ACM. T. Comput. Bi, 8 (2011) 1633. 22

ACCEPTED MANUSCRIPT [21] V. Pihur, S. Datta, S. Datta, RankAggreg, an R package for weighted rank aggregation, BMC bioinformatics, 10 (2009) 62. [22] K. Kira, L.A. Rendell, The feature selection problem: Traditional methods and a new algorithm, AAAI, 2 (1992) 129. [23] I. Kononenko, in: F. Bergadano, L. De Raedt (Eds.), Machine Learning: ECML-94 (Lecture Notes in Computer Science), Springer Berlin Heidelberg, 1994, p. 171. [24] T. Naes, I.S. Helland, Relevant components in regression, Scand. J. Stat. (1993) 239.

RI PT

[25] T.N. Tran, N.L. Afanador, L.M.C. Buydens, L. Blanchet, Interpretation of variable importance in Partial Least Squares with Significance Multivariate Correlation (sMC), Chemometr. Intell. Lab. syst., 138 (2014) 153.

[26] Y.-H. Yun, D.-S. Cao, M.-L. Tan, J. Yan, D.-B. Ren, Q.-S. Xu, L. Yu, Y.-Z. Liang, A simple idea on applying large regression coefficient to improve the genetic algorithm-PLS for variable selection in multivariate calibration,

SC

Chemometr. Intell. Lab, 130 (2014) 76.

[27] Y.-H. Yun, H.-D. Li, L.R. E. Wood, W. Fan, J.-J. Wang, D.-S. Cao, Q.-S. Xu, Y.-Z. Liang, An efficient method of wavelength

interval

selection

based

on

random

Spectrochim.Acta.A., 111 (2013) 31.

frog

for

multivariate

spectral

calibration,

and applications, Umetrics, 2001.

M AN U

[28] L. Eriksson, E. Johansson, N. Kettaneh-Wold, S. Wold, Multi-and megavariate data analysis: principles

[29] L. Eriksson, T. Byrne, E. Johansson, J. Trygg, C. Vikström, Multi-and megavariate data analysis basic principles and applications, Umetrics Academy, 2013.

[30] R. Gosselin, D. Rodrigue, C. Duchesne, A Bootstrap-VIP approach for selecting wavelength intervals in spectral imaging applications, Chemometr. Intell. Lab. syst., 100 (2010) 12.

[31] B.-C. Deng, Y.-H. Yun, P. Ma, C.-C. Lin, D.-B. Ren, Y.-Z. Liang, A new method for wavelength interval

TE D

selection that intelligently optimizes the locations, widths and combinations of the intervals, Analyst, 140 (2015) 1876.

[32] Y.-H. Yun, W.-T. Wang, B.-C. Deng, G.-B. Lai, X.-b. Liu, D.-B. Ren, Y.-Z. Liang, W. Fan, Q.-S. Xu, Using variable combination population analysis for variable selection in multivariate calibration, Anal. Chim. Acta, 862 (2015) 14.

EP

[33] Y.-H. Yun, W.-T. Wang, M.-L. Tan, Y.-Z. Liang, H.-D. Li, D.-S. Cao, H.-M. Lu, Q.-S. Xu, A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration, Anal. Chim. Acta, 807 (2014) 36.

[34] B.-C. Deng, Y.-H. Yun, Y.-Z. Liang, L.-Z. Yi, A novel variable selection approach that iteratively optimizes

AC C

493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536

variable space using weighted binary matrix sampling, Analyst, 139 (2014) 4836. [35] H.-D. Li, Y.-Z. Liang, Q.-S. Xu, D.-S. Cao, Model population analysis for variable selection, J. Chemometr, 24 (2010) 418.

[36] B.C. Deng, Y.H. Yun, Y.Z. Liang, D.S. Cao, Q.S. Xu, L.Z. Yi, X. Huang, A new strategy to prevent over-fitting in partial least squares models based on model population analysis, Anal. Chim. Acta, 880 (2015) 32~41. [37] P.J. GREEN, Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika, 82 (1995) 711. [38] R. Berk, Statistical Learning from a Regression Perspective (Springer Series in Statistics), Springer New York, 2008, p. 1. [39] S. Lin, J. Ding, J. Zhou, Rank aggregation of putative microRNA targets with Cross-Entropy Monte Carlo methods, (Preprint, presented at the IBC 2006 conference, Montreal) (2006). [40] V. Pihur, S. Datta, S. Datta, Weighted rank aggregation of cluster validation measures: a Monte Carlo 23

ACCEPTED MANUSCRIPT cross-entropy approach, Bioinformatics, 23 (2007) 1607. [41] T. Fearn, News 3 Interview: Katherine Bakeev 4 Meetings: NIR on the Go 6 Quasi-imaging spectrometer with programmable field of view 8 Laboratory Profile: Regional Breeders Association of Lombardy 11, 2010, p. 201014. [42] M. Zeng, Y. Liang, H. Li, M. Wang, B. Wang, X. Chen, N. Zhou, D. Cao, J. Wu, Plasma metabolic fingerprinting of childhood obesity by GC/MS in conjunction with multivariate statistical analysis, J. Pharmaceut. Biomed., 52 (2010) 265.

RI PT

[43] A.M. Proenza, P. Roca, C. Crespı́, I. Lladó , A. Palou, Blood amino acid compartmentation in men and women with different degrees of obesity, J. Nutr. Biochem., 9 (1998) 697~704.

[44] M.W. Hulver, J.R. Berggren, R.N. Cortright, R.W. Dudek, R.P. Thompson, W.J. Pories, K.G. MacDonald, G.W. Cline, G.I. Shulman, G.L. Dohm, J.A. Houmard, Skeletal muscle lipid metabolism with obesity, Am. J. Physiol-Endoc. M., 284 (2003) E741.

energy expenditure, Am. J. Clin. Nutr., 82 (2005) 320~326.

SC

[45] C.L. Kien, J.Y. Bunn, F. Ugrasbul, Increasing dietary palmitic acid decreases fat oxidation and daily

[46] N.G. Psihogios, R.G. Kalaitzidis, S. Dimou, K.I. Seferiadis, K.C. Siamopoulos, E.T. Bairaktari, Evaluation of tubulointerstitial lesions' severity in patients with glomerulonephritides: An NMR-Based metabonomic study, J. Proteome. Res., 6 (2007) 3760.

AC C

EP

TE D

555 556

M AN U

537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554

24

ACCEPTED MANUSCRIPT Figure Caption

558

Fig. 1. Variable ranking lists of different methods and aggregated ranking list by rank

559

aggregation for the childhood overweight data

560

Fig. 2. DCV results of rank aggregation through sequentially addition for modeling

561

according to variable ranking index (the number in X axis), e.g. 1 means the top

562

variable is used, and 3 means that the top 3 variables are used for DCV. a: children

563

obesity data, the top 3 variables have the highest prediction accuracy and AUC value,

564

86.21% and 0.8534; b: tubulointerstitial lesions data, the top 3 variables have the

565

highest prediction accuracy and AUC value, 100% and 0.96;

566

Fig. 3. Top 30 variables ranking lists of different methods and aggregated ranking list

567

by rank aggregation for the tubulointerstitial lesions data

M AN U

SC

RI PT

557

AC C

EP

TE D

568

25

ACCEPTED MANUSCRIPT Table caption

570

Table 1. 10-fold DCV results of all variables, rank aggregation and LASSO on the

571

childhood overweight data.

572

Table 2. 10-fold DCV results of all variables, rank aggregation and LASSO on the

573

tubulointerstitial lesions data.

RI PT

569

AC C

EP

TE D

M AN U

SC

574

26

ACCEPTED MANUSCRIPT

nVara

Sensitivity (%)

Specificity (%)

Accuracy (%)

AUC

The selected variables

All variables

30

81.25 (13/16)

46.15 (6/13)

65.52 (19/29)

0.6346

/

Rank aggregation

3

87.50 (14/16)

84.62 (11/13)

86.21 (25/29)

0.8534

12, 23, 5

LASSO

7

81.25 (13/16)

84.62 (11/13)

82.76 (24/29)

0.8077

12, 13, 27, 19, 5, 23, 2

the number of selected variables.

SC

Method

M AN U

a

RI PT

Table 1. 10-fold DCV results of all variables, rank aggregation and LASSO on the childhood overweight data.

Table 2. 10-fold DCV results of all variables, rank aggregation and LASSO on the tubulointerstitial lesions data. Sensitivity (%)

All variables

200

92.0(23/25)

Rank aggregation

3

100(25/25)

LASSO

8

100(25/25)

Specificity (%)

Accuracy (%)

AUC

The selected variables

100(25/25)

96.0(48/50)

0.9200

/

100(25/25)

100(50/50)

0.9600

142, 138, 182

100(25/25)

100(50/50)

0.9600

32, 33, 182, 126, 141, 138, 142, 105

TE D

nVar

AC C

EP

Method

AC C EP TE D

M AN US C

RI PT

AC C EP TE D

M AN US C

RI PT

AC C EP TE D

M AN US C

RI PT

Hierarchical Rank Aggregation with Applications to Nanotoxicology.

Metabolomics for biomarker discovery in gastroenterological cancer.

Mass spectrometric based approaches in urine metabolomics and biomarker discovery.

A rank-based sequence aligner with applications in phylogenetic analysis.

Biomarker Discovery and Translation in Metabolomics.

Metabolomics for Biomarker Discovery: Moving to the Clinic.

Mass spectrometry-based analysis of glycoproteins and its clinical applications in cancer biomarker discovery.

Lipidomics applications for disease biomarker discovery in mammal models.

Proteomics in translational cancer research: biomarker discovery for clinical applications.

Establishment of Protocols for Global Metabolomics by LC-MS for Biomarker Discovery.

Protein Array-based Approaches for Biomarker Discovery in Cancer.

Metabolomics in drug discovery.

Hair metabolomics: identification of fetal compromise provides proof of concept for biomarker discovery.

Emerging field of metabolomics: big promise for cancer biomarker identification and drug discovery.

A bootstrap based Neyman-Pearson test for identifying variable importance.

Ovarian cancer biomarker discovery based on genomic approaches.

A roadmap for natural product discovery based on large-scale genomics and metabolomics.

A comprehensive workflow of mass spectrometry-based untargeted metabolomics in cancer metabolic biomarker discovery using human plasma and urine.

Development of urinary pseudotargeted LC-MS-based metabolomics method and its application in hepatocellular carcinoma biomarker discovery.

Neuroimaging-based biomarker discovery and validation.

Network-Based Protein Biomarker Discovery Platforms.

Mass spectrometry-based N-glycoproteomics for cancer biomarker discovery.

A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes.

Metabolomics of biomarker discovery in ovarian cancer: a systematic review of the current literature.