Solving Chemical Problems with Pattern Recognition* B. R. Kowalski** Department of Chemistry, University of Washington, Seattle, Washington C. F. Bender*** University of California, Lawrence Livermore Laboratory, Livermore, California Pattern Recognition is becoming established as a general data analysis tool which has widespread applications in chemistry. Whenever something must be learned from objects (elements, compounds, and mixtures) and a chemical/physical theory has not been sufficiently developed, pattern recognition m a y provide a solution. Materials production problems, screening applications, source identification and structure analysis are important areas of current interest. It is expected that many more areas of application will open up in the years to come. In short, the "educated guess" is being supported by the computer; at least that is our educated guess.

Introduction

For centuries scientists have relied heavily on graphical methods for interpreting experimental information. Measurements are often related by simple graphs which can lead to recognizable functions. The function can be fitted to the data and thereafter used to replace the graphical data. For example, if a scientist measures the response from an experiment as a function of time and plots the results as shown in Fig. 1, he then can recognize a pattern in the graph that indicates an exponential relationship between response and time.

x

Ignoring the experimental errors involved, curve fitting can lead to an equation of the form:

R (t) = A~-~

(i)

where R is the response, t the time, and A and B are constants found by some mathematical procedure. Suppose now that two experimental measurements are plotted as in Fig. 2. For example, the measurements might be the concentrations of two trace metals found in the drinking water of various cities. It certainly does not take a trained eye to note two groups or clusters of points in the two-dimensional (2-space)plot. B y itself, the plot has very little significance. If, however, all of the cities corresponding to points in tile upper left-hand corner of the 2-space plot had an abnormally high incidence of a particular health disorder, while

x R

x

X x X

x

x

x X

x

X

X

X

x

X

X

x x

X

x

X X

X x

XxX

x x

XX X x

X

Time

Fig. t. Example of a functional relationship between response and time * Work performed under the auspices of the U.S. Atomic Energy Commission. ** "Supported in part by Office of Naval Research and the National Science Foundation. * * * M . H . Fellow.

10

x

x

X

X X

X

X "X

X X

X

X

X X

X

x

X

X

X

X

X x

x

Measurement # 1

Fig. 2. Example of t w o separated groups or clusters

Naturwissenschaften

62,

tO--t4

(1975)

9

by Springer-Verlag

t975

Pattern Recognition Philosophy

X X

X

X X

X

x

X

X x

X X x

X

X Xx X

X

X

X X

X

X

X X

X X

X X X

X X

X X X X

X X X

X

X X

X

X

Measurement # I

F ~ g . 3- N o c l e a r " c l u s t e r i n g "

,

indicates

additional

measurements

may be required cities represented in the other group were normal with respect to the disorder, the plot would demand a considerable amount of attention. These two types of graphical examples represent two situations often encountered in science. The first example can be classified as parametric; that is, the data conforms to a standard functional relationship which m a y be determined and then used to great advantage. Such problems have adequate t h e o r e t i c a l bases. The second example Call be classified as nonparametric; that is the data does not appear to be representable in a simple functional form. The important point of similarity between the two examples is that a relationship has been found between independent and dependent variables. In the parametric model, time is related to experimental response; in the nonparametric model concentrations of trace metals are together related to a health disorder. Unfortunately, problems of the second type are not covered b y all adequate theoretical body of knowledge. Instead, the scientist must be content with graphical techniques to study such relationships. Often the "educated guess" is the only tool which is applicable to these difficult problems. Suppose the scientist in the second example theorized that the health disorder was related to the concentration of the two metals in drinking water, but the plot looked like Fig. 3. One possible solution would be to make several other measurements on the samples of water and plot the results using an N-dimensional plot, where N is the number of measurements made on each sample. Since we live in a 3-dimensional world, N-space ( N > 3 ) , or, as it is often called, hyperspace, is difficult to visualize and examine. Plotting all combinations of 2-space plots can result in a significant loss of information, and the N (N--1)/2 plots can be numerous. The scientist has two other alternatives for studying his data. One is to project or map the N-space data structure to the familiar 2- or 3-space b y methods which try to minimize the loss of certain attributes of the N-space data structure. The second alternative is to use techniques which examine the data in N-space. In the remainder of this paper, methods and techniques from the newly emerging field of pattern recognition [1] are presented which can be used for both alternatives. The methodology of the branches of pattern recognition is briefly covered and applications to chemical problems are discussed. Naturwissenschaften 62, 10--14 (1975)

Pattern recognition was originally developed in the field of artificial intelligence and has been applied in diverse areas such as handwritten and printed alphanumeric character recognition, weather prediction, and medical diagnosis. Recently, pattern recognition was introduced to the chemical literature as a general problem-solving tool [2]. A general statement of the broad class of problems amenable to analysis b y pattern recognition is: "Given a collection of objects characterized b y measurements made on each object , can an obscure property of the objects be found and/or predicted that is related to the measurements via some unknown relationship ?" [2]. In chemistry, the objects range from pure compounds (e.g. a prospective anticancer agent) to complicated mixtures such as water samples taken from a city-water supply or complex natural products. Properties include atomic or molecular structure, chemical or biological reactivity, absorptivity, etc. and often must be predicted using experimental measurements which are known (or thought) to be related to the sought-for property. A functional breakdown of the field of pattern recognition is shown in Figure 4. Parametric methods assume that the probability density functions are known or can be estimated; here Bayes techniques [t] are used. For most chemical applications, the underlying statistical distributions of the data are not known, so the nonparametric branch is used. The branching under nonparametric methods stems from two questions that must be asked at the onset of an application. t. W h a t must be learned from or about the objects ? 2. Are the measurements useful in their original form ?

Preprocessing techniques operate on the data (measurements) in order to change the representation of the information contained within them. Learning techniques are applied once the data is in the most useful form and lead (hopefully) to the desired result of being able to recognize an obscure property in a collection of objects from measurements made on the objects. Display techniques provide the first alternative, allow-

I PAT/ERN RECOG~ITIONI PARAMETRIC METHODS

I-{IONPARAMETRIMETHODS C I

Fig. 4. Functional breakdown

9 by Springer-Verlag t975 .

of pattern recognition

11

ing for 2-dimensional visualization of the N-dimensional data. A geometric description of pattern recognition will be used to illustrate the use of the measurements. The objects are considered points in N-dimensional space, where N is equal to the number of measurements made on each object. Obviously the same measurements must be made on each object. The values of the measurements then represent the coordinates of each point in N-space. The nearness in space between two points is a good measure of similarity between corresponding objects. The Euclidean distance between object i and/' is : ]J.!2

[ iv

where X k{ is the value of the k-th measurement made on object i. The summation runs over all the N measurements. Actually, dis is a reciprocal similarity measurement because objects are more alike as diy goes to zero. Since the concepts of Euclidean geometry hold true in higher-dimensional space, and since visual examination of points in N-space (N > 3) is impossible, computers are used to analyze the data. The computer can be used to (t) represent the data in the most useful form (preprocessing), (2) map the points from N-space into two or three dimensions for visualexamination (display), (3) find clusters or densities of points (unsupervised learning) and (4) construct classification rules then classify unknown points (supervised learning). Learning procedures operate in N-space in either the supervised or unsupervised modes. In the case of supervised learning, some of the points in N-space 9have known classifications or properties and constitute the training set. A classification rule is developed for the training set, and the rule is then used to classify unknowns. In unsupervised learning, there are no known classifications and meaningful inter-relationships between objects are developed by finding densities or clusters of points in N-space. Normally, the property determined by supervised learning is a class membership (i.e., active anti-cancer drug vs. inactive) ; however, methods are being developed to handle continuous properties.

Preprocessing Methods Preprocessing methods axe applied to data for two reasons. First, for certain pattern recognition methods, the ratio of patterns (objects or points) to features (measurements) should be greater than two and preferably on the order of ten, hence preprocessing can be used to reduce the number of features. Second, preprocessing methods can be used to enhance or weight important information in the original features. In geometrical terms, the dimensionality of the data can be reduced and/or the data structure can be rotated or distorted. Scaling is an important preprocessing step. If measurements have widely different ranges of possible values, some form of scaling must be applied to insure that features are given equM weight. AutoscMing [2] creates new features from the old, xk{

12

-

~k

(3)

where X~i is the new k-th feature for the i-th pattern. Xz is the feature mean calculated for the entire data set, and a~ is the feature variance, 1

N

X~ = ~- ~ Xki,

(4)

i=1 N

~ = X ( x ~ - G ) ~.

(5)

{=1

Note that the n e w features have zero m e a n and unit variance.

If there is reason to believe that a particular measurem e n t is more important than the others, that feature

can be weighted. Thus, weighting forms the second important preprocessing method. New features are calculated from the scaled features as

x;'~ : & x ~ t

(6)

where /Jk is a weight which can be assigned by the scientist (rarely) or found by several automatic weighting calculations (i.e. variance weighting [2] or Fisher weighting [3]). New features can also be calculated which are either linear or non-linear combinations of the original features. The literature abounds with such procedures [4, 5]. Before proceeding to the next section, it is important to point out the two general types of measurements encountered in chemical applications. Spectral data is obtained from a digitized spectrum (mass, infrared, etc.) measured on the objects under study. For this type of data, transform methods (Fourier, Hadaxd, etc.) can be applied as preprocessing methods; and indeed, such methods can produce significantly better classification results [6]. For non-spectral data (traceelement concentrations, for example) such transformarion methods have not shown similar improve= ments.

Display One of the alternatives to studying N-dimensional data structure is to project or map the data structure to a lower-dimensional space which is more amenable to human examinaton [7]. Although it is possible to analyze 3-space using interactive computer graphics, only 2-space methods will be discussed here. Linear proieetion methods amount to calculating 2-space coordinates for each pattern as a linear combination of the N-space features. The projection is the same as preprocessing to reduce to two features and then plotting the new features for each pattern. Eliminating all but two of the original features and plotting the remaining two is an example of a trivial linear projection. A more sophisticated procedure is the eigenvector projection method [7]. In this method, the featureby-feature covariance matrix C is diagonalized, Cy = 2 y .

(7)

The matrix elements of C are calculated as M

G~= i =22 (x~,-xD ( x . - : < ) . l

(8)

M is the number of patterns. Yx and Y~, the eigenvectors corresponding to the largest two eigenvalues

Naturwissenschaften 62, t0--14 (t975)

9 by Springer-Verlag 1975

21 and 22, are used to calculate the coordinates (ai, bi) of each pattern in 2-space. N

ai= Y, 5qkX~ k=l

(9)

hi= ~ Y=kX~i

Supervised Learning

A nonlinear mapping method generates a 2-space plot for which the two coordinates are not linear combinations of the original N features. An example is the NLM method proposed by Sammon [8] which was later modified for application to chemical data [71. In this method an error function M

d~i(d~- d~)~

(10)

i

Solving chemical problems with pattern recognition.

Pattern Recognition is becoming established as a general data analysis tool which has widespread applications in chemistry. Whenever something must be...
537KB Sizes 0 Downloads 0 Views