Computerized Medical Imaging and Graphrcs. Printed m the USA. All nghts reserved.

Vol. 15. No. 4, pp. 217-223.

1991 CopyrIght

VALIDATION OF MAGNETIC RESONANCE IMAGING MULTISPECTRAL TISSUE CLASSIFICATION*

0895.6111/91 $3.00 + .oO c 1991 Pergamon Press plc

(MRI)

Michael W. Vannier, Thomas K. Pilgram, Christopher M. Speidel, Lynette R. Neumann, Douglas L. Rickman’ and Larry D. Schertz Mallinckrodt Institute of Radiology, Washington University School of Medicine, 510 S. Kingshighway Blvd., St. Louis, MO. 63 110, ‘NASA Earth Resources Laboratory, National Space Technology Laboratories,

Bay St. Louis, MS 39529 (Received 14 December

1990)

Abstract-The application of NASA multispectral image processing technology for analysis of Magnetic Resonance Imaging (MRI) scans has been studied. Software and hardware capability has been developed, and a statistical evaluation of multispectral analysis application to MRI scans of the head has been performed. Key Words: Magnetic Resonance (MR), Image processing, Tissue characterization, Brain, MR studies, Brain neoplasms

1. It is a quantitative means of measuring the behavior of multidimensional imaging systems, such as MRI. multispectral methods have 2. In other applications, been useful in identifying subtleties that would otherwise be overlooked. The data 3. MR images are intrinsically multispectral. in a set of MR images is highly redundant, both geometrically and radiometrically (4). 4. Multispectral methods are well developed, have been implemented on computers, for which software is readily available that can process MR image data efficiently, and can adapt to existing MR scanners.

INTRODUCTION Magnetic Resonance Imaging or MRI offers the potential for noninvasive acquisition of tissue discriminant information about the body. The potential of MRI for improved disease detection and improved morphologic diagnosis has been realized in part. NASA has vast experience in remote sensing and image processing and has developed numerous systems and techniques for processing multispectral data such as that provided by Landsat satellites. Since MRI data consists of multiple channels of independent but geometrically registered medically significant data, it is analogous to multispectral remote sensing data. The application of this NASA technology is, therefore, natural and powerful. Multispectral analysis of proton MR images (l-3) may provide tissue characteristic information encoded therein. Using well established methods for computer processing of multispectral images, tissue characterization signatures are sought, using supervised or unsupervised classification methods. The radiometric and geometric distortions introduced by the MRI scanner can be removed before feature extraction and classification. Signatures may be derived from one set of multispectral MR images at a single anatomic level and be applied in the same subject at later times, to other anatomic levels containing the same tissues, or to other subjects. The principal advantages of multispectral analysis include:

*Presented at the 9th International Conference cognition, Rome, Italy, November 14-17, 1988.

MATERIALS

AND METHODS

A study to evaluate the potential for multispectral MR analysis was performed. MR scans of the head were obtained with an OST clinical imager (Siemens Magnetom) yielding data sets of three images from two different acquisition sequences at each anatomic level studied. Two pulse sequences gave a total of 3 spin echo images, including (a) TR = 0.3 sec. TE = 30 msec; (b) TR=lS, TE=30; and (c) TR=lS, TE= 120. These data were entered into an image processing system consisting of a VAX 1 l/750 minicomputer host, IIS Model 75 hardware image processor and IIS System/600 software (International Imaging Systems, Inc., Milpitas, CA.). Color composite images were generated after preprocessing of the MR data sets. Several of the color patterns obtained from this simple multiparameter display scheme were remarkably consistent and indicate the subjective utility of multispectral tissue classification analysis for head MR data.

on Pattern Re-

217

218

Computerized

Medical Imaging and Graphics

Signal variations in the gray and white matter of MRI brain images is largely due to the presence of radiametric inhomogeneity. While in vivo white matter is relatively homogeneous in its paramagnetic properties, certainly within a few percent, we observed variations in MR white matter signal intensity component at different locations in the brain of 30 to 50% more. These variations are due to radiometric inhomogeneities of the magnetic and radio frequency environment which must be corrected before quantitative analysis. After correcting the MR images, signal strength variation in the tissue components of the brain were reduced to a few percent. Supervised and unsupervised classification algorithms (5) available in statistical image pattern recognition software (System/600 from International Imaging Systems, Milpitas, CA and NASA Earth Resources Laboratory Applications Software or ELAS (6) provide a means for classification and discrimination of multiband images. The supervised method relies on interactive identification of regions of interest and classification of data points (pixel values) into user defined classes based on one of a variety of classification algorithms. The unsupervised method for classification of multispectral data sets is accomplished without operator intervention (7). Accordingly, with the unsupervised method, the operators cannot employ their knowledge of the scene content nor can they easily correct classification errors. The supervised classification algorithms we used included minimum distance from the mean, maximum likelihood, and parallelepiped classifiers (8). The unsupervised classifier we employed uses a minimum distance algorithm for cluster analysis. A cluster is a homogenous group of pixels that are very similar to one another. Prior to executing the supervised pattern classification algorithm a statistical analysis (e.g., population means and covariance matrices) of the spectral data is calculated. Examination of these statistical descriptors either in tabular or histogram form allows the proficient operator to edit the pixel values in the regions of interest to enhance the likelihood that each pixel value will correspond to only one training class. With experience, for example, the operator can recognize that a large standard deviation indicates that pixels of different land types (satellite data) or tissue types (magnetic resonance data) have been assigned to the same training class. This statistical data is then used to design a classifier using a suitable algorithm. For the unsupervised classification, the user specifies the number of classes to be defined. The computer calculates the population means and covariance matrix elements by performing a clustering analysis of the input

July-August/l99

1, Volume 15, Number 4

data. The classifier iterates, splitting and merging classes according to predefined criteria. One difficulty with this approach is that it is sometimes difficult to ascertain beforehand the best criteria for splitting and merging classes, and the best choice for the number of classes that can be separated using spectral signatures. Pixels of these and other multiband images are classified according to statistical pattern recognition of certain characteristics of each tissue component in a scene. Each supervised classification algorithm we tested used the initial statistical data from training regions to classify each pixel in a scene. These pixels are assigned to either one of the predefined tissue classes that are represented in the training set or to a reject class. The reject class is defined as containing those pixels that are too different from any of the training classes. The output of the classification sequence is a class map of the scene content on a pixel by pixel basis (9). To validate the data sets, an expert observer (LDS) manually traced the outlines of each anatomic structure as an overlay to the original scans. These scan data were manually segmented into anatomic regions on the IIS image processing system, using its roam and zoom capabilities and color graphic overlays. In this manner, the original head MRI scan images were transformed into manually segmented anatomic theme maps where the encoded regions were uniquely represented by their class numbers on a pixel by pixel basis. This manual classification step was very time consuming, as might be anticipated, since each region was drawn on a pixel by pixel basis using cursor control step switches, a trackball and x-y tablet with up to 8 times magnilication of the original data set on the display screen. These manually produced class maps are taken as ground truth for comparison with machine generated classification of the MR images. The manually constructed images are in exact registration with the MR data sets as a result of the manner in which they were constructed. Therefore, the comparison between MR multispectral classification and manually segmented images measures only the error due to signal measurements and not error due to anatomic misregistration or spatial distortion (10). Using the methods outlined above, we designed classifiers to separate the soft tissues present in these coronal MRI images of the head and neck that were obtained on nine patients with a clinical system (Siemens Magnetom). These classifiers were tested for their ability to discriminate gray and white matter, cerebrospinal fluid, edema, fat, and abnormal tissue (e.g., tumor, cyst) using manually traced regions as a standard. Several classifiers were developed using statistical pattern recognition software, including parallelepiped, mini-

Validation

219

of MRI 0 M. W. VANNIERer al.

Table 1. Confusion matrix for patient GH. Automatic

Reject % Reject

s i: 4 g$ Ed

a 2

Classification

(Minimum

Distance) Edema

PVL-Edema

WM

cyst

CSF

GM

58323 100.00%

2930 46.62%

671 39.47%

2519 62.03%

5487 43.83%

59 1.45%

2071 16.54%

:.05%

0.63%

0.00%

0.00%

34 0.82%

1 0.63%

0.00%

0.00%

0.00%

208 5.04%

20 12.58%

6386

63 39.62%

1

70402 80.76% 4813

WM %WM

0.00%

2680 42.64%

cyst % cyst

0.00%

1 0.02%

CSF % CSF

0.00%

0.00%

0.00%

546 13.44%

GM %GM

0.00%

581 9.24%

9 0.53%

913 22.48%

Edema % Edema

1 0.02%

0.00%

1 0.02%

0.00%

313 7.58%

2 1.26%

317

0.00%

198 11.65%

23 0.57%

306 2.44%

3162 76.60%

72 45.28%

3853

0.00%

92 1.46% 6285 7.21%

1700 1.95%

4061 4.66%

12518 14.36%

4128 4.74%

159 0.18%

87175

PVL-Edema % PVL-Edema Total # Total %

58324 66.90%

0.00% 822 48.35%

mum distance to means, cluster analysis, and maximum likelihood techniques. Signatures for each scene class were determined for one image using both supervised classification algorithms and unsupervised cluster analysis. The accuracy of classification was measured in multi-class test cases, and was best with maximum likelihood. Classifiers were tested after preprocessing protocols were altered, statistical parameters of different tissue classes were edited, or larger regions of interest (ROI) were chosen. This was done to determine the effect of these steps on classification accuracy. The remaining eight patient images were then analyzed with a maximum likelihood classification algorithm. The reject class in Table 1 reflects the presence of pixels which contained no tissue (such as the air outside of the patient’s body). In MR image processing, the reject class is often the largest, especially when large fields of view are used to include the entire head and neck. CLASSIFICATION There were five classification algorithms tested, and within each algorithm as many as two possible options to the standard operations. The classification algorithms (Table 1) were: cluster analysis, minimum distance, maximum likelihood, virtual maximum likelihood, and parallelepiped as listed in the following table. The options available on some of these methods were to use a large training region and to manipulate some of the statistics (usually standard deviation) used as a basis of classification:

Total %

Total #

Reject

5.52% 858 0.98% 546

4654 37.18%

Training Region Size

Classification Algorithm Cluster Analysis Minimum Distance Maximum Likelihood Virtual Maximum Likelihood Parallelepiped

X X

0.63%

7.33%

0.36%

4.42%

100.00%

Statistical Manipulation (e.g., alter standard deviations)

X X

X

In most cases there were several runs with each technique and option. In general, there seemed to be little change resulting from successive runs with a given combination of classification algorithm and option (training region size and alterations in standard deviations). What improvement there was occurred primarily in the runs with statistical manipulation, where experimentation revealed the most useful changes.

ANALYSIS The assessment of accuracy for each classifier was performed by comparing the computer-generated class map with a corresponding manually segmented version. The results were tabulated in a confusion matrix (Table 1) using a spreadsheet program (Microsoft Excel on an Apple Macintosh II personal computer). The manually segmented results were entered in columns, and

220

Computerized

Medical Imaging and Graphics

Table 2. Measures of classification

Algorithm

Option *

Cluster Analysis Minimum Distance Minimum Distance Minimum Distance Maximum Likelihood Maximum Likelihood Maximum Likelihood Virt. Max. Likelihood Parallelpiped Parallelpiped *S = Statistical

manipulation,

S TR s TR

TR

July-August/

199 1, Volume 15, Number 4

quality for best run of each technique.

Accuracy

Full Data Kappa hat

92.63% 90.83% 91.09% 92.46% 93.47% 93.99% 94.58% 91.34% 93.52% 94.15%

0.6322 0.3513 0.3908 0.5439 0.6830 0.7075 0.7359 0.3701 0.6324 0.7098

Variance 0.0000102 0.0000436 0.0000385 0.0000248 o.OOOOO95 O.OOOOO98 o.OOOOO99 0.0000522 0.0000196 O.OC00108

Accuracy

Partial Data Kappa hat

Variance

33.06% 16.69% 19.06% 31.50% 40.66% 45.41% 50.78% 21.34% 41.10% 46.85%

0.2665 0.1164 0.1327 0.2044 0.2902 0.3085 0.3569 0.1583 0.2961 0.3412

0.0000652 0.0001393 0.0001198 0.0000705 0.0000528 0.0000568 0.0000549 0.0001513 0.0000595 0.0000729

TR = Large training region.

computer-generated ones in rows. Ideally, all values should appear in the diagonal. Classification quality was evaluated by two methods. The first evaluation method was accuracy, which is simply the total number of pixels classified correctly divided by the total number of pixels, expressed as a percentage. This gives clear and useful results, but does not consider what level of agreement might be expected by chance alone. The second evaluation method used was to calculate the Kappa hat statistic. Kappa hat is currently accepted as the best available statistic for evaluating mapping techniques (1 l), and compares expected and actual agreement to produce an index where a value of unity indicates perfect agreement and zero indicates chance agreement. The rankings of a given set of mapping techniques will usually be identical for the two techniques, but Kappa hat will give a better (though less immediately intuitive) measure of utility, and with calculation of the variance of Kappa hat (12), allows tests of statistical significance (13, 14). Because there was an inherent incompatibility in the images and the classification programs, two versions of the data set were analyzed. The incompatibility resulted from the presence of information for regions outside the head in the MR image. The classification programs, which were developed for mapping the earth’s surface, did not allow regions to be simply removed from consideration. Areas could be classified as “REJECT,” but were still part of the automatic classification process. For this study, anatomical features outside the braincase were of no interest, so bone, skin, muscle, etc. were classified as “REJECT,” and the air surrounding the head, though irrelevant to any study, were of necessity classified as “REJECT” as well. Because air constituted the majority of the image, the majority of the pixels represented a trivial and irrelevant classification task. Therefore, in addition to

analyzing the full data set, a partial data set was analyzed. The partial set had all true “REJECT” pixels removed. This also removed errors of omission, which were very few (at most 2 pixels), while leaving errors of commission. The latter are mostly misclassifications of the “inside the head” REJECT regions and are a meaningful topic for study. Because N is so much greater for the full data (approximately 10 X ), variance is quite a bit smaller. Since N for the partial set is so large, this is not important. Most differences will be statistically significant with either set of results. Both data sets were biased. The full data set tended to exaggerate the capability of the classification systems by including a large number of facile and unimportant tasks. The partial data set tended to underestimate classification performance by removing the correctly identified pixels from REJECT regions inside the head, while leaving the errors. The bias in the partial data set involved fewer pixels than in the full one, so its results are probably more useful. RESULTS Calculated values of accuracy, Kappa hat, and variance of Kappa hat are presented in Table 2 for both data sets. The values in the table are for the best classification by each combination of classification algorithm and option. Several patterns are evident in this data. First, the large number of correct but meaningless “outside the head” REJECT classifications in the complete data set greatly reduce the usefulness of accuracy as a basis for comparing classification accuracy in the complete data set. This does not occur with accuracy measures in the partial data set, because the REJECT region is not considered. The Kappa hat statistic solves the problem more elegantly by taking into account both proportion correct

Validation

of MRI 0 M. W. VANNIER et al.

221

Table 3. Z- scores for differences of kappa hat (partial data). Parallelepiped Maximum Likelihood Maximum Likelihood Maximum Likelihood Parallelepiped

Parallelepiped

TR

4.55051796 2.87118749 - 1.383927 3.9182103

S TR

0.55825126 -1.1484911 - 5.6770145

and number of correct classifications expected by chance. The Kappa hat values provide a useful basis for comparison for both data sets, and those for the full data set actually show a greater spread in values than those for the partial data. In spite of these problems and differences, both accuracy and Kappa hat give identical rankings for both full and partial data sets. This indicates a certain robustness, and helps lend credence to the patterns evident. The rankings show clear superiorities among the classification techniques. Maximum likelihood is the best classification algorithm, followed closely by parallelpiped and more distantly by cluster analysis. Minimum distance and virtual maximum likelihood do poorly on these tasks. Within the classification algorithms, large training regions provided the best results. Statistical manipulation of the results from small training regions gave a small gain over the unmodified results. Most of the differences between classification techniques are statistically significant. Table 3 lists the Z-scores and Table 4 the p-values for the differences between the best five classifications. These results are based on the values for Kappa hat and its variance from the partial data. The Z-scores would be greater and the p-values lower for the full data set, since the difference between the Kappa hat values is greater, and their variance is smaller because of the larger sample size.

DISCUSSION MR image data sets are intrinsically multispectral (15). The fundamental processes that form the variations in signal which make up MR images are based

Max. Like. TR

Max. Like. S

6.42350227 4.57352515

1.748681

on local variations in several parameters, including relaxation times, proton density, and flow (16). Added to this are noise, arising in the electronics or entering from the environment, and instrumental effects, such as magnetic field and radio frequency inhomogeneities. Humans are relatively poorly equipped to manually process multiparameter image data represented as simultaneous monochrome images shown sequentially or adjacently ( 17, 18). We employ a multiparameter color display to form color composite images of MRI data sets to facilitate the subjective interpretation of local variations in human tissue. The method is based on the simultaneous mapping of three corresponding MR head slices from the same anatomic level, obtained in the same individual during the same examination, into the three phosphors of a color CRT display. For a three-band MR multiparameter display, we map the first band into red, the second into green, and the third into blue. These are simultaneously viewable on the CRT in composite color. The color patterns obtained with this scheme are interpreted subjectively. To maintain the consistency of color patterns, we typically map a chemical shift image into red, T, weighted image into green and T, weighted image into blue. Multispectral analysis of MR head images, especially via statistical classification methods, is based on several assumptions (l-4, 19, 20) which include: That tissue characteristic signatures do exist, and are sufficiently unique that they can be detected by statistical methods and can be utilized to separate components of MR scenes. That MR image data sets can be made geometrically and radiometrically homogeneous, so that by sampling

Table 4. P-values for differences of kappa hat (partial data). Parallelepiped Maximum Likelihood Maximum Likelihood Maximum Likelihood Parallelepiped

S TR

Validation of magnetic resonance imaging (MRI) multispectral tissue classification.

The application of NASA multispectral image processing technology for analysis of Magnetic Resonance Imaging (MRI) scans has been studied. Software an...
793KB Sizes 0 Downloads 0 Views