Generalization Evaluation of Machine Learning Numerical Observers for Image Quality Assessment.

IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 60, NO. 3, JUNE 2013

1609

Generalization Evaluation of Machine Learning Numerical Observers for Image Quality Assessment Mahdi M. Kalayeh, Thibault Marin, Member, IEEE, and Jovan G. Brankov, Senior Member, IEEE

Abstract—In this paper, we present two new numerical observers (NO) based on machine learning for image quality assessment. The proposed NOs aim to predict human observer performance in a cardiac perfusion-defect detection task for single-photon emission computed tomography (SPECT) images. Human observer (HumO) studies are now considered to be the gold standard for task-based evaluation of medical images. However such studies are impractical for use in early stages of development for imaging devices and algorithms, because they require extensive involvement of trained human observers who must evaluate a large number of images. To address this problem, numerical observers (also called model observers) have been developed as a surrogate for human observers. The channelized Hotelling observer (CHO), with or without internal noise model, is currently the most widely used NO of this kind. In our previous work we argued that development of a NO model to predict human observers’ performance can be viewed as a machine learning (or system identification) problem. This consideration led us to develop a channelized support vector machine (CSVM) observer, a kernel-based regression model that greatly outperformed the popular and widely used CHO. This was especially evident when the numerical observers were evaluated in terms of generalization performance. To evaluate generalization we used a typical situation for the practical use of a numerical observer: after optimizing the NO (which for a CHO might consist of adjusting the internal noise model) based upon a broad set of reconstructed images, we tested it on a broad (but different) set of images obtained by a different reconstruction method. In this manuscript we aim to evaluate two new regression models that achieve accuracy higher than the CHO and comparable to our earlier CSVM method, while dramatically reducing model complexity and computation time. The new models are defined in a Bayesian machine-learning framework: a channelized relevance vector machine (CRVM) and a multi-kernel CRVM (MKCRVM). Index Terms—Channelized hotelling observer (CHO), human observer (HumO), image quality, numerical observer, relevance vector machine (RVM), single-photon emission computed tomography (SPECT).

I. INTRODUCTION

I

MAGE quality is traditionally evaluated by using numerical criteria such as signal-to-noise ratio (SNR), bias-variance trade-off, etc., [1]. However, such metrics have significant limitations, as recognized in the 1970s in the context of radiographic imaging by Lusted [2], who pointed out that an image

Manuscript received October 23, 2012; revised January 30, 2013; accepted March 26, 2013. Date of publication May 16, 2013; date of current version June 12, 2013. This work was supported under grants NIH/NHLBI HL065425 and HL091017. The authors are with the Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616 USA (e-mail: [email protected]; [email protected]; [email protected]). Digital Object Identifier 10.1109/TNS.2013.2257183

Fig. 1. Block diagram of a HumO and NO in a lesion-detection task. A HumO reports response describing his confidence that a lesion is present. In NO de, between image features and the sign the goal is to find mapping, human observer response .

can reproduce the shape and texture of structures faithfully from a physical and quantitative standpoint, while failing to contain useful diagnostic information, especially if the principal agent for diagnosis is a human observer. Lusted postulated that, to measure the worth of a diagnostic imaging test, one must assess the observer’s performance when using the imaging test. Lusted argued that receiver-operating characteristic (ROC) [1] curves provide an ideal means to assess the observer’s performance in deciding the patient’s condition. ROC curves and their analysis are based on statistical decision theory, originally developed for electronic signal detection [3]. Swets and Pickett [4] stressed the medical importance of two properties captured by ROC curves: 1) they display all possible trade-offs of specificity versus sensitivity; and 2) they allow calculation of the cost and benefits of one’s decisions in disease treatment. Based on these ideas, the ROC methodology [5] and the use of human observers for task-based evaluation are now widely accepted as the gold standard for assessment of medical image quality [1], [6]. Unfortunately, human observer (HumO) studies are costly, time consuming and hard to organize and therefore impractical for optimizing and testing imaging devices and algorithms. For this reason numerical observers (NO), mathematical surrogates for human observers, have received growing attention as a practical alternative for predicting human diagnostic performance and facilitating automated image quality assessment. The most popular NO is the channelized Hotelling observer (CHO) [7], a statistical based method that extracts features from images via non-overlapping, symmetrical, band-pass channels and applies a linear discriminant to these features (see Figs. 1 and 2). In short, rationale behind CHO has two underlying ideas: the band-pass nature of the human visual system and that in low contrast scenario human visual detection is roughly a linear discriminant process [1]. CHO is based on an easily understandable idea and is also very simple to implement.

0018-9499/$31.00 © 2013 IEEE

1610


We explored this idea previously in a short conference paper [22] where we showed that the channelized RVM’s (CRVM) performance was better than that of CSVM in terms of prediction accuracy, while requiring a lower computational complexity for model selection. In this work, we expand the initial CRVM evaluation to a multiple kernel RVM model [21]. Fig. 2. Feature extraction channels in the frequency domain.

II. METHODOLOGY These properties, accompanied by relatively good performance in predicting human diagnostic performance in some settings [8], contributed to the popularity of the CHO. However, its performance in detection tasks is not always well correlated with human observers [9], [10]. Therefore, the addition of an internal-noise model is often necessary to improve the correlation of the CHO’s performance with that of a human observer [10]. The internal noise model is selected and tuned to match the performance of a given observer. Different types of internal noise models have also been proposed (see [11]–[13] for a review). Alternative numerical observers are, for example, proposed in [14]–[17] based on human linear template estimation; in [18] and [19] efforts are made to incorporate nonlinearity into models. In the methods proposed in this work we retain the bandpass model of the visual system by using a channeling operator; but we aim to learn the discriminant function on the basis of the human observer’s responses to images, rather than the ground truth. This concept was employed in [16] but with restriction to a linear discriminant. In [9], using channelized support vector machine (CSVM), our group explored this idea, without restricting to a linear discriminant. We showed that the CSVM, a non-linear kernel regression model, can achieve significantly better performance than the CHO. The cornerstone of the support vector machine (SVM) is statistical learning theory, utilizing the structural risk minimization (SRM) principle [20] to achieve good generalization when supplied to data beyond those available for training [9]. When assessing the quality of medical images, generalization performance is of utmost importance: a NO tuned to one type of images must be capable of yielding accurate predictions from a different type of images (for example, images obtained via a different tomographic reconstruction algorithm). In the language of machine learning, the NO model must be capable of good generalization performance [11], an issue has been somewhat neglected in the CHO literature, where the CHO is typically tested using the same images, or images of the same type (but different noise realizations), as those used for internal noise model tuning. In this work we aim to revisit the kernel regression approach proposed previously [9]. In that work, excellent performance was obtained by using a CSVM, but the computation time needed to select the model parameters selection was onerous (on the order of a day). To address these problems, in this work we propose the use of a relevance vector machine (RVM) [21] to replace the SVM. The RVM is a kernel-based learning model in the family of modern Bayesian algorithms with lower model selection complexity, better generalization performance, lower computational cost, and the ability to use unconditioned kernel functions [21].

In a human observer study for lesion detection with lesion location known exactly (LKE), the human observer evaluates each image and assigns a score, , that reflects the observer’s confidence that a lesion is present in the image. This is repeated for a number of images followed by an evaluation of the observer’s performance using the ROC methodology [5], which calculates the area under the ROC curve (AUC) as a final performance metric. Now in the proposed machine learning approach, we model the human observer’s score for a given image by a regression function of a feature vector extracted from the image, where the regression function is parameterized by . This can be summarized by the following regression model: (1) where denotes the regression model and is the modeling error (see Fig. 1). Initially, this function is unknown, but the model parameters are estimated based on data collected from a human observer study. This regression function represents the potentially complex relationship between the images and the human observer’s responses to these images. A. Feature Extraction In this work we used the same image features as those used in the traditional CHO methodology, namely the outputs of four constant-Q frequency band filters presented in the work of Myers and Barrett [7], who argued that a human visual system can be modeled as a series of non-overlapping, rotationally symmetric band-pass (BP) filters. Examples of such filters, with cutoff frequencies of 0.5, 0.25, 0.125, 0.0625 and 0.0313 cycles/pixel, are shown in Fig. 2. Image features are extracted by calculating the signal energy passed through each of these four filters, resulting in a four-dimensional feature vector describing each image. Note that while other channels may be considered (such as Gabor filters, overlapping difference of Gaussians (DOG), sparse DOG and Laguerre-Gaussian (LG) filters [1], [23]–[25]), in this work we focused on traditional CHO channels. We anticipate that the improvement, if any, yielded by a different selection of channels would benefit all the accompanying learning methods in a comparable fashion. B. Human Observer Study In this work, as a test example, we used a previously published human observer study from a simulated SPECT myocardial-perfusion acquisition, which utilized the OSEM iterative image reconstruction algorithm [27] with various post-reconstruction filtering levels (specific details are provided later). In the human observer study, the observer’s task was to determine the presence of a myocardium perfusion defect based on the

KALAYEH et al.: GENERALIZATION EVALUATION OF MACHINE LEARNING NUMERICAL OBSERVERS

1611

intensity change in the image. In the dataset considered, two observers were asked to score their confidence in the presence of a defect on a discrete, scale from 1 to 6 (1 denotes that the lesion is definitively not present, and 6 indicates that the lesion is definitively present) in an LKE environment. This procedure was repeated over 100 images corresponding to different noise realizations, for a combination of six filtering levels and two different numbers of OSEM iterations, so that we have a total of image scores and feature pairs . In the subsequent analysis, these data were carefully split into training and test sets, denoted by and , respectively, which were mutually exclusive, i.e., . C. Channelized Support Vector Machine The first supervised learning regression model considered is the support vector machine (CSVM), which we previously evaluated in [9]. In this kernel regression model, the data is first mapped from the original feature space, , into a higher dimensional feature space using the following, for now unknown, transformation:

Fig. 3. Several radial basis function (RBF) kernels of different width.

optimization procedure [9]. This gives the following expression for the CSVM regression model:

(2) Now the SVM methodology uses a linear regression in the augmented feature space, represented as: (3) where is the predicted image score. The optimum values for and are determined through a structural risk minimization (SRM) procedure from [11], [31]:

(7) but only the Here we note that there is no need to know inner product which is usually more conveniently described by a kernel function. In this work (as in [9]) we use radial basis function (RBF) kernels described as:

(4) where the -insensitive loss function is defined as:

(5) The soft margin loss function [30], [31] does not penalize prediction error smaller than the . Regression function in (4) deteris also called -insensitive regression. The constant mines the trade-off between model complexity (controlled by the magnitude and number of non-zero elements of ) and fi). Note that minimization of the norm delity (measured by of is equivalent to minimization of the capacity of the regression function in (3) [20]. The value of and are determined using a cross-validation procedure described in Section II.D. has the following form: It can be shown that the optimal (6) which is defined by so-called support vectors, a subset of training samples, denoted by where is the number of support vectors and is obtained by a quadratic programming

(8) denotes the Gaussian kernel’s width also referred where to as kernel parameter in this manuscript (see Fig. 3 for example kernels). Alternatively, SVM can use polynomials as well as sigmoid kernel functions [28], all of which should meet Mercer’s condition [21], [36]. Mercer’s condition has to satisfy states that the kernel mapping such that for any is finite. The final expression for the CSVM regression model, from (7), can be rewritten as: (9) where and

.

D. K-Fold Cross-Validation (CV) Training of the CSVM requires selection of the model parameters: kernel width, , fidelity penalization, , and insensitive tube, . To find optimal values for these parameters we use a -fold cross-validation (in our experiments 6-fold) on the training dataset. In this method, the training data set is split into -folds. are used for model optimization (i.e., finding

1612


and ) and the remaining fold for testing. This is repeated for every fold and the performance metric is averaged over folds. As a final performance metric we used the averaged mean squared prediction error defined as:

(10) denotes number of elements in the kth fold. where This is repeated for every possible value of parameters and . It is worth mentioning that the range of used in cross-validation has to be wide so that a good global optimum value can be found (a coarse to fine approach can be used to first locate a range of values in which the optimal parameter is likely to be found). E. Channelized Relevance Vector Machine RVM is a modern Bayesian learning approach [29], which has several benefits over SVM: probabilistic interpretation, use of dramatically fewer basis functions, lower model selection complexity and ability to use arbitrary non-‘Mercer’ kernels [21]. In a Bayesian approach, we ideally desire a predictive distribution of : (11) marginalized over all nuisance model parameters here denoted by , where denotes the observed training data. The nuisance model parameters define the regression model but are not of a particular interest by themselves. Equation (11) cannot be expressed in closed-form so we must seek an effective approximation as described next [21]. For clarity, we omit to note the implicit conditioning upon the set of input feature vectors, , in (11) and subsequent expressions. 1) Model Specification: In the RVM approach, the relationship between features and scores is defined by: (12) where and is the kernel function (e.g., a radial basis function (RBF) with a kernel width ) that maps the feature space to a non-linear space where linear regression is applied. Note that the RVM and SVM have conceptually the same forms given in (9) and (12). The RVM methodology [21] introduces a hierarchical prior model that can be summarized as:

(13)

Here denotes that the variable is distributed accordingly, while denotes a Gaussian and is a Gamma probability density function. To make Gamma hyper-priors non-informative (i.e., flat) we fix their parameters to small values i.e., . As we can see from (13), the RVM uses a hierarchical prior probability distribution on the weights that will encourage sparsity (producing mostly zero weights). To see this we need to calculate the true prior over weights, obtained by marginalization over . This marginalization will result in a Student-t distribution whose density has contours that introduce sparsity (see [21] for details). 2) Inference: Now having a Bayesian approach we desire to find the predictive distribution, as defined in (11), marginalized over all nuisance model parameters . This is usually not possible analytically and therefore we must seek an effective approximation. In this work we will use the fact that the posterior of can be approximated by a delta function at its mode, i.e., at its most-probable values . We can assume this point-estimate as the representative of the posterior, that is, for the predictive purposes we just require

(14) to be a good approximation [21]. This will result in a predictive distribution given by: (15) here are the mean and covariance matrix of respectively and are iteratively estimated as we will summarize in Table I. A complete description of the RVM methodology can be found in [21]. Here we will point out several advantages of RVM over SVM: 1. The RVM’s Bayesian formulation reduces the number of free parameters, when compared to the SVM, in which optimization requires a time consuming cross-validation step. In RVM we are left only with the choice of -kernel width which is embedded in and . This parameter can be found by cross-validation. 2. In addition, this probabilistic view allows computing marginal likelihoods that is very useful in model parameters selection and comparison. For example, we can opt to use marginal likelihood to find -kernel width instead of crossvalidation. 3. RVM is a probabilistic regression so that for each regression point one can also characterize the uncertainty in prediction that is . 4. Finally RVM is a special case of Gaussian Process (GP) [29] where the spatial correlation (see (15)) is defined by . Using GP consideration, it is evident that the proposed RVM regression model is finding the solution that belongs to the reproducing kernel Hilbert space (RKHS) defined by the kernel function .


1613

a simpler case is presented in this work. Here we also have to augment and as:

TABLE I ITERATIVE STEPS OF THE RVM ALGORITHM

(23) (24) The rest of the equations for MKCRVM are exactly the same as for the single kernel CRVM and as such will be omitted. The MKCRVM method will find optimal weights for every kernel in the consideration. Therefore, MKCRVM does not need crossvalidation for model parameter selection: unused kernel width will receive zero weights. H. Channelized Hotelling Observer

5. The RKHS interpretation motivated the inclusion of multikernel model to allow the final solution to belong to RKHS defined with multiple kernel functions.

For completeness of this manuscript we will give a brief description of the CHO. The CHO is the most widely used numerical observer that serves as a surrogate for human observers in evaluations of image quality. CHO is a cascade of two linear operators, which in practice can be combined into one. The first operator, channel operator, measures features of the image by applying for example non-overlapping, band-pass filters (BP) as in Fig. 2 or other commonly used Gabor filters, overlapping difference of Gaussians (DOG), sparse DOG and Laguerre-Gaussian (LG) filters [1], [23], [24], [25]. The second operator, called the Hotelling observer, computes a test statistic, , for choosing between defect absent and defect present hypotheses based on the observed feature vector . This test statistic is defined as: (25)

F. Robustness of SVM and RVM While the SVM relied on -insensitivity to relax its model and produce many zero value elements in , the RVM approach does not include such an explicit control. Instead, RVM avoids sensitivity to small errors by manipulating the pruning criteria, due to the Student-t distribution over , enforcing sparsity over elements in .

where

in which is the expected value of under hypothesis is the covariance matrix of , and models an internal noise. In this work we will consider the following three types of internal noise models which are applied to diagonal elements of as:

G. Multi-Kernel Channelized Relevance Vector Machine The second new regression model proposed in this paper is the multi-kernel channelized relevance vector machine (MKCRVM). Considering RVM interpretation as a Gaussian process [29] it quickly becomes evident that both CRVM and CSVM regression models represent relations between image features and HumO scores as a stationary RKHS defined by a single kernel. There is no evidence that the human observer regression model should obey this assumption. Thus, by including several kernel functions one can expect improved regression accuracy. Mathematically, the multi-kernel CRVM (MKCRVM) is easily obtained by expanding the design matrix as:

When properly chosen, the internal noise parameter can improve the agreement between the CHO and the human observer detection performance. The CHO’s detection performance is usually summarized as [1]:

(22)

(27)

where is a “sub design” matrix with as kernel width and denotes number of different kernels (see Fig. 3). Next we would like to point out that we could also consider a mixture of different kernel types (like linear, polynomial etc.,) although

For more details readers can refer to [1], [11]–[14]. Note that the CHO can predict: the ROC and/or the AUC (AUC is given in (27)) but cannot predict directly image scores as given by the human observers.

(26)

1614


Fig. 4. Example images of myocardial perfusion SPECT imaging with defects, human observer (HumO) scores are shown near the image: 1-definitely no defect; 6-defect definitely present.

Fig. 5. Example images of myocardial perfusion SPECT imaging without a defect, human observer (HumO) scores shown near the images: 1-definitely no defect; 6-defect definitely present.

III. EXPERIMENTAL RESULTS

the two observers was negligible (the AUC measures varied by less than 4% between observers; see [9] for a detailed analysis). Therefore, the image score that we define is an average of the two human observers’ scores. Accordingly, the reader variability is associated with the variability of both readers (we can interpret it as two observers jointly scoring the images). In the future, we may revisit this issue by training two numerical observers, one for each human observer. The HumO study is followed by HumO performance evaluation by calculating the AUC [5].

A. Human Observer Study As a case example for NO development, we used data from a previously reported study on myocardial-perfusion imaging by SPECT imaging modality and reconstruction via an OSEM iterative image reconstruction algorithm [27]. The images are simulated using the Mathematical Cardiac Torso (MCAT) [32] phantom which generates average activity and attenuation map considering respiratory motion, the wringing motion of the beating heart, and heart chamber contraction. The phantom images are generated on a 128 128 128 grid with voxel size of 0.317 cm with a simulated perfusion defect in left ventricular wall. Projections were obtained using a Monte Carlo method (SIMIND [33]) including effects of non-uniform attenuation, photon scatter and distance-dependent resolution in order to simulate a low-energy high-resolution collimator. The average amount of 3.8 million counts for the noise was considered to relatively cover different clinical standards [27]. An ordered subsets expectation-maximization (OSEM) reconstruction algorithm with both one and five effective iterations was utilized to reconstruct myocardial perfusion images. Next, reconstructed images were low-pass filtered by 3-D Gaussian filters with 0, 1, 2, 3, 4 and 5 pixels as full width at half maximum (FWHM) values. These procedures led to 12 different sets of reconstructed images, each one corresponding to one FWHM and one number of effective iterations. Figs. 4 and 5 show several examples from these studies with related scores. Throughout a human observer study, the observer’s task was to determine the presence of the myocardium defect based on the intensity change in the image. In the dataset we used, two observers were asked to score the defect presence likelihood on a 1 to 6 (1-definitely no; 6-definitely yes) discrete scale in a location known exactly (LKE) environment. This procedure was repeated over 100 images (corresponding to 100 different noise realizations) for each specific combination of FWHM and effective iteration numbers. The difference in performance between

B. Numerical Observers Training-Testing In this work, we study a type of generalization that is perhaps the most representative of the practical use of a NO. All NOs were trained using a broad set of images, and then tested on a different, but equally broad, set of images. Specifically, we trained NOs using images for one iteration of the OSEM and every value of the filter FWHM (600 images for training), and then tested on images obtained with five iterations of OSEM and every value of the filter FWHM (600 images for testing). The experiment was repeated with the roles of one and five iterations reversed. In each scenario, CHO, CSVM, CRVM and MKCRVM were optimized to minimize generalization score prediction error (10) (or AUC error in case of CHO) measured using 6-fold cross validation based only on corresponding training data. For SVM parameter optimization we used the MOSEK optimization toolbox (MOSEK [34]), which implements a fast quadratic programming optimization. For both RVM methods we used the SPIDER machine-learning environment [35] with minor modifications for the multi-kernel RVM. Each model parameter was optimized using 6-fold cross-validation (CV) as previously described. For CSVM we used the following parameters space and where for CRVM . In multi-kernel CRVM there was no need for cross-validation. MKCRVM used three kernels from a


TABLE II OPTIMAL MODEL PARAMETERS OVER TRAINING DATA

1615

TABLE IV EXAMPLE IMAGES WITH: HUMAN, CSVM, CRVM AND MKCRVM SCORES

TABLE III NUMERICAL OBSERVERS EXECUTION TIME AND NUMBER OF SVS AND RVS

set of values used in previous methods e.g., . Optimal model parameters are summarized in Table II. Table II shows that even though the same kernel functions are used, the feature space mapping is different between CSVM and the CRVM since their associated values are different. Next we present the results, in Table III, where we compare computation time as well as number of support or relevance vectors. These results clearly show that the computational burden for CSVM model parameter selection is that of CRVM and of MKCRVM all due to the time consuming cross-validation process. In the same table, one can observe that both CRVM and MKCRVM have lower number of relevance vectors in comparison to the number of support vectors of CSVM. Such behavior is usually desirable [20] especially if, as we will see later in Table V, the model accuracy is not compromised. In Table IV one can examine the relationships between the predicted scores and scores given by a human observer on an image-by-image basis. These images are randomly selected. One can see that MKCRVM consistently produces better predictions. Next in Fig. 6 we show the mean-squared error (MSE) between NO predicted scores and HumO scores over the testing data sets as a function of post reconstruction filter FWHM. These numbers are summarized in Table V as an average MSE and MSE standard deviation. These numbers indicate that MKCRVM is the most accurate regression model followed by CRVM. Here, once again, we want stress that in MKCRVM we do not need to use -fold cross-validation to optimize kernel width or any other model parameter. Next in Fig. 7 we can see the results for the observer’s performance as it is summarized by ROC methodology and AUC values for each reconstruction method [5]. The error bars represent the one-standard deviation values of the HumO AUC estimates. One can observe that MKCRVM and CSVM are able to closely follow the human AUC curve. This is also confirmed by the average absolute AUC error reported in Table V. From

TABLE V COMPREHENSIVE COMPARISON OF CSVM, CRVM AND MKCRVM

this table one can observe that the MSE does not always correlate with the mean absolute AUC error. This is because MSE measures scoring prediction accuracy and does not directly reflect the separability between two classes which is measured by AUC. The AUC measure is insensitive to outliers at the end of the scoring scale or shift in scores values whereas MSE is not. Next in Table VI we provide the -values of hypothesis testing of the null hypothesis that there is no difference, in AUC values, between human and numerical observers. These values are calculated using ROCKIT [37]. One can see that CRVM method has multiple points where the -value is lower than 0.1 (shaded in gray), which means that it fails to capture HumO performance for several FWHM values. On the other side, CSVM and MKCRVM perform similarly: -value is lower than 0.1 on two points for each method.

1616


Fig. 7. Comparison of performances of human observer, CSVM, CRVM and MKCRVM in terms of area under ROC (AUC); (a) training on iter 1 and testing on iter 5 data; (b) training on iter 5 and testing on iter 1 data. Error bars represent one-standard deviation of human observer AUC estimates. Fig. 6. Comparison of prediction performance of CSVM, CRVM and MKCRVM in terms of generalization mean-squared-error; (a) training on iter 1 and testing on iter 5 data; (b) training on iter 5 and testing on iter 1 data.

To complete the comparisons in Table VII we present mean absolute AUC error (same as the error measure as in Table V) for CHO with different internal noise models [14]. This experiment shows that CHO, for the tested data set, does not perform as well as other regression models. In addition it seems that CHO is not sensitive to the assumed type of the internal noise model. Last we would like to point out that CHO does not predict HumO scores but can only be used to predict performance metrics such as ROC or AUC. In separate preliminary studies [11]–[13] we also implemented CHO with Gabor filters, DOG, SDOG and LG channels as well as eleven internal noise models and we noted that the accuracy in any combination was lower than proposed CRVM and MKCRVM. Analysis of all possible combination is out of the scope of this work and is in considerations for a separate publication.

TABLE VI P-VALUES OF HYPOTHESIS TESTING OF THE NULL HYPOTHESIS THAT THERE IS NO DIFFERENCE, IN AUC VALUES, BETWEEN HUMAN AND NUMERICAL OBSERVERS

IV. DISCUSSION It is noted that in this study the two human observers had similar rating and their scores were consequently averaged. In a more general setting, especially with more than two observers, there can be more inter-observer variability. The proposed methodology in principle can still be extended to modeling the individual observers, e.g., training a numerical observer


TABLE VII MEAN AUC ERROR OF CHO FOR DIFFERENT INTERNAL NOISE MODELS

1617

ACKNOWLEDGMENT The authors thank the medical physics group at the University of Massachusetts Medical Center, led by M. A. King, for generously providing the human-observer data used in the experiments. REFERENCES

for each human observer, in order to accommodate the larger inter-observer variability. This can be an interesting issue for future research (see [38] for a preliminary investigation). In addition, the two sets of images, iter 1 and iter 5, used in this study were reconstructed by OSEM with various post-reconstruction filtering, but from the same phantom. Thus, the generalization task might be relatively easier in this SKE/BKE task since the only variability was from the Poisson noise in the measured sinograms. The proposed methodology does not incorporate the imaging noise or object model directly, and thus, is expected to be applicable when additional object and background variability is included. V. CONCLUSION In this paper, we proposed and evaluated two new Bayesian learning regression models as numerical observers for medical image quality assessment, namely channelized relevance vector machine (CRVM) and multi-kernel channelized relevance vector machine (MKCRVM). Both aim to mimic the human observer’s performance in a cardiac perfusion defect detection task in single photon emission computed tomography (SPECT) images. This technique inherits the design simplicity and low computational cost of the relevance vector machines, which is a modern class of Bayesian learning methods and allows reduction in execution time. In experiments we compared the proposed methods with channelized Hotelling observer (CHO) and the previously proposed channelized support vector machine (CSVM). The presented results show that MKCRVM method has the best accuracy measured by mean squared score-prediction error. At the same time MKCRVM method has the smallest computational complexity since it avoids time consuming model selection step. If we are concerned principally about the aggregate performance, as measured by the fidelity of the area under receiver operating curve (AUC), then the newly proposed MKCRVM considerably outperforms CRVM and CSVM methods and it is followed by CSVM. All learning methods under consideration outperform the classical CHO. Finally, based on statistical comparison between NO and human observes, CSVM and MKCRVM perform equally well and MKCRVM takes only a fraction of the computation time of CSVM.

[1] H. H. Barrett and K. Myers, Foundations of Image Science. New York: Wiley, 2003. [2] L. B. Lusted, “Signal detectability and medical decision making,” Science, vol. 171, pp. 1217–1210, 1971. [3] W. W. Peterson, T. G. Birdsall, and W. C. Fox, “The theory of signal detection theory,” Trans. IRE Professional Group on Information Theory, pp. 171–212, 1954. [4] J. A. Swets and R. M. Pickett, Evaluation of Diagnostic Systems: Methods From Signal Detection Theory. New York: Academic Press, 1982. [5] C. E. Metz, “Basic principles of ROC analysis,” Seminar Nucl. Med., vol. 8, pp. 283–298, 1978. [6] J. of International Commission on Radiation Units and Measurements, Inc. ICRU Report 54, 1996. [7] K. J. Myers and H. H. Barrett, “Addition of a channel mechanism to the ideal-observer model,” J. Opt. Society America, vol. 4, no. 12, pp. 2447–2457, Dec. 1987. [8] J. Yao and H. H. Barrett, “Predicting human performance by a channelized Hotelling model,” in Proc. SPIE, Dec. 1992, vol. 1786, pp. 161–168. [9] J. G. Brankov, Y. Yang, L. Wei, I. El Naqa, and M. N. Wernick, “Learning a channelized observer for image quality assessment,” IEEE Trans. Med. Imaging, vol. 28, no. 7, pp. 991–999, July 2009. [10] C. Abbey and H. H. Barrett, “Human- and model-observer performance in ramp-spectrum noise: Effects of regularization and object variability,” J. Opt. Society of America, vol. 18, no. 3, pp. 473–488, Mar. 2001. [11] J. G. Brankov, “Optimization of the internal noise models for channelized Hotelling observer,” in Proc. IEEE Int. Symp. n Biomedical Imaging: From Nano to Macro, Mar. 30–Apr. 2 2011, pp. 1788–1791. [12] J. G. Brankov, “Comparison of the internal noise models for channelized Hotelling observer,” in Proc. IEEE NSS/MIC, Oct. 23–29, 2011, pp. 4369–4372. [13] J. G. Brankov, “Optimization of the channelized Hotelling observer internal-noise model in a train-testing paradigm,” presented at the Med. Image Percep. Conf. XIV, Dublin, Ireland, 2011. [14] Y. Zhang, B. T. Pham, and M. Eckstein, “Evaluation of internal noise methods for Hotelling observer models,” Med. Phys., vol. 34, no. 8, pp. 3313–3321, 2007. [15] Y. Zhang, B. T. Pham, and M. P. Eckstein, “Task-based model/human observer evaluation of SPIHT wavelet compression with human visual system-based quantization,” Acad. Radiol., vol. 12, no. 3, pp. 324–336, 2005. [16] C. K. Abbey and M. P. Eckstein, “Optimal shifted estimates of human-observer templates in two-alternative forced-choice experiments,” IEEE Trans. Med. Imaging, vol. 21, no. 5, pp. 429–440, 2002. [17] C. Castella, C. K. Abbey, M. P. Eckstein, F. R. Verdun, K. Kinkel, and F. O. Bochu, “Human linear template with mammographic backgrounds estimated with a genetic algorithm,” J. Opt. Society America A, vol. 24, no. 12, pp. B1–B12, 2007. [18] Y. Zhang, B. T. Pham, and M. P. Eckstein, “The effect of nonlinear human visual system components on performance of a channelized Hotelling observer in structured backgrounds,” IEEE Trans. Med. Imaging, vol. 25, no. 10, pp. 1348–1362, 2006. [19] M. P. Eckstein, A. J. Ahumada, Jr., and A. B. Watson, “Visual signal detection in structured backgrounds. II. Effects of contrast gain control, background variations, and white noise,” J. Opt. Society America A, vol. 14, no. 9, pp. 2406–2419, 1997. [20] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [21] M. E. Tipping, “Sparse Bayesian learning and the relevance vector machine,” J. Machine Learning Research, vol. 1, pp. 211–244, Jun. 2001. [22] M. M. Kalayeh, T. Marin, P. H. Pretorius, M. N. Wernick, Y. Yang, and J. G. Brankov, “Channelized relevance vector machine as a numerical observer for cardiac perfusion defect detection task,” SPIE Medical Imaging, vol. 7966, 2011. [23] B. D. Gallas and H. H. Barrett, “Validating the use of channels to estimate the ideal linear observer,” J. Opt. Society America A, vol. 20, pp. 1725–1738, 2003.

1618

[24] M. A. Kupinski, E. Clarkson, J. W. Hoppin, L. Chen, and H. H. Barrett, “Experimental determination of object statistics from noisy images,” J. Opt. Society America A, vol. 20, pp. 421–429, 2003. [25] S. Park, H. H. Barrett, E. Clarkson, M. A. Kupinski, and K. J. Myers, “Channelized-ideal observer using Laguerre-Gauss channels in detection tasks involving non-Gaussian distributed lumpy backgrounds and a Gaussian signal,” J. Opt. Society America, vol. 24, no. 12, pp. B136–B150, 2007. [26] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” in Data Mining and Knowledge Discovery. New York: Kluwer Academic Publishers, 1998, vol. 2, pp. 121–167, Number 2. [27] M. V. Narayanan, H. C. Gifford, M. A. King, P. H. Pretorius, T. H. Farncombe, P. Bruyant, and M. N. Wernick, “Optimization of iterative reconstructions of 99 m/Tc cardiac SPECT studies using numerical observers,” IEEE Trans. Nuclear Science, vol. 49, no. 5, pp. 2355–2360, Oct. 2002. [28] D. J. C. MacKay, “Bayesian interpolation,” Neural Computation, vol. 4, no. 3, pp. 415–447, May 1992. [29] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. Cambridge: MIT Press, 2006. [30] K. P. Bennett and O. L. Mangasarian, “Robust linear programming discrimination of two linearly inseparable sets,” Optim. Meth. Software, vol. 1, pp. 23–34, 1992.


[31] C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273–297, 1995. [32] P. H. Pretorius, M. A. King, B. M. W. Tsui, K. J. LaCroix, and W. Xia, “A mathematical model of motion of the heart for use in generating source and attenuation maps for simulating emission imaging,” Med. Phys., vol. 26, pp. 2323–2332, 1999. [33] M. Ljungberg and S.-E. Strand, “A Monte Carlo program for the simulation of scintillation camera characteristics,” Computer Methods and Programs in Biomed., vol. 29, no. 4, pp. 257–272, Aug. 1989. [34] MOSEK, Aps, Denmark [Online]. Available: http://www-.mosek.com [35] SPIDER, Max Planck Institute for Biological Cybernetics [Online]. Available: http://www.kyb.Tuebingen.mpg.de/bs/people/spider/ [36] R. Courant and D. Hilbert, Methods of Mathematical Physics. Hoboken: Wiley-Interscience, 1953. [37] ROCKIT, [Online] Available: [Online]. Available: http://metz-roc. uchicago.edu/MetzROC/software/software [38] J. G. Brankov and P. H. Pretorius, “Personalized numerical observer,” in Proc. SPIE, 2010, vol. 7627.

NMF-Based Image Quality Assessment Using Extreme Learning Machine.

Extreme learning machine for ranking: generalization analysis and applications.

Blind image quality assessment via deep learning.

Learning to rank for blind image quality assessment.

Image based Machine Learning for identification of macrophage subsets.

Using financial risk measures for analyzing generalization performance of machine learning models.

A New Perspective for the Training Assessment: Machine Learning-Based Neurometric for Augmented User's Evaluation.

Objective Video Quality Assessment Based on Machine Learning for Underwater Scientific Applications.

Evaluation of water quality based on a machine learning algorithm and water quality index for the Ebinur Lake Watershed, China.

Automated Quality Assessment of Structural Magnetic Resonance Brain Images Based on a Supervised Machine Learning Algorithm.

Evaluation of xeroradiographic image quality.

Evaluation of Fisher Information Matrix-Based Methods for Fast Assessment of Image Quality in Pinhole SPECT.

Machine learning for medical applications.

Machine Learning for Medical Imaging.

A mesh generation and machine learning framework for Drosophila gene expression pattern image analysis.

Universal blind image quality assessment metrics via natural scene statistics and multiple kernel learning.

Is extreme learning machine feasible? A theoretical assessment (part II).

Is extreme learning machine feasible? A theoretical assessment (part I).

Extreme learning machine based optimal embedding location finder for image steganography.

Perceptual Quality Assessment for Multi-Exposure Image Fusion.

Evaluation of image quality in tomographic imaging.

Fractal analysis for reduced reference image quality assessment.

Automatic no-reference image quality assessment.

Numerical Evaluation of Image Contrast for Thicker and Thinner Objects among Current Intraoral Digital Imaging Systems.