APPLIED MICROBIOLOGY, Aug. 1975, P. 282-289 Copyright i 1975 American Society for Microbiology

Vol. 30, No. 2 Printed in U.S.A.

Principal Component Analysis of Infraspecific Variation in Bacteria GARY DARLAND' Enterobacteriology Branch, Center for Disease Control, Atlanta, Georgia 30333

Received for publication 28 January 1975

In certain types of ecological investigations it may be desirable to investigate infraspecific variation in bacteria. Principal component analysis is demonstrated to be satisfactory for this purpose. Hypothetical bacterial populations were used to show that such analysis can be used to compare collections of bacterial isolates taken at different times or from different sources. Alternatively, given n isolates, whether they represent a single bacterial population can be determined. The method is applied to authentic collections of bacteria in three separate analyses. The results are compatible with current taxonomic tenets.

Although bacteriologists have long accepted the fact that variation is the rule in bacterial species and have for the most part abandoned a typological approach to taxonomy, they have paid little attention to a systematic analysis of infraspecific variation. Such analyses are necessary to answer fundamental questions in bacterial taxonomy and are of potential interest in certain types of ecological or epidemiological studies. A question raised is whether a collection of bacterial isolates taken during one interval of time is the same as a similar sample taken at some other time. In such studies, the basic question can be reduced to the following: given a sample of n bacterial isolates from diverse sources or habitats and a set of m tests, what is the probability that all n isolates represent the same bacterial population? The purpose of the following paper is to illustrate how principal component analysis can be used to investigate such questions. A discussion of the mathematical details of principal component analysis can be found in several textbooks (1, 14). Briefly, the technique consists of a series of linear transformations of the original m-dimensional observation vector into a new vector of principal components. Three consequences of this particular type of transformation are of importance in taxonomic studies. First, although a maximum of m principal axes exist, it is generally possible to explain a major portion of the sample variance with k < m axes. Second, the principal axes are mutually perpendicular and hence the principal components are uncorrelated. This greatly reduces the 'Present address: Merck Institute for Therapeutic Research, Basic Microbiology Department, Rahway, N.J. 07065.

number of parameters necessary to explain the structure of the population in question. Third, although the total variance of the sample is unchanged by the transformation, it is partitioned in such a way that variance between groups is maximized and variance within groups is minimized (14). Principal component analysis has been used in taxonomic studies of Staphylococcus (11), Bacillus (13), soil microorganisms (17), and in a study of 167 isolates of Streptococcus faecalis (4). The relationship between principal component analysis and the more conventional form of numerical taxonomy, which involves calculation of various types of matching coefficients (16) has been discussed by Gower (8). Although principal component analysis and the calculation of matching coefficients give similar results, Gower concluded, on the basis of computational time involved, that matching coefficients may be of more value for routine taxonomic studies. However, because of the uncertainty concerning the distribution of matching coefficients, they do not appear to be of use in describing variation within a well-defined taxon. As will be demonstrated, principal component analysis may be the method of choice for investigations of this type (see also reference 16). Gyllenberg (10) has suggested the use of principal components as a means of computer identification of bacteria. In this particular method, a collection of well-characterized strains is used, and the group centroids are determined. After calculating the principal component scores of an unknown organism, the unknown is assigned to that taxon to which it is

282

ANALYSIS OF INFRASPECIFIC VARIATION IN BACTERIA

VOL. 30, 1975

the closest in terms of distance. Gyllenberg suggested that a minimum of five principal axes may be needed for such a method. MATERIALS AND METHODS Analytical procedure. Data analysis was performed by using a slightly modified version of FACTO (IBM, System/360 Scientific Subroutine Package, Version HI). In this algorithm each individual isolate, X, is treated as a vector, where a vector element, x,, represents the isolate's score on test i. The first computational step is to determine the correlation matrix, R, whose elements, rgj, represent the product moment correlation coefficient between variables i and j, that is, z

(xu,

Xl) (Xii,

-

xJ)

Xt) 2'(XJt

-

XJ)

-

rgJ

7 2

(X1(t

-

ij

2

=

1, 2,...,

m

In the above equation x is the mean value of test i through n isolates. R is a symmetric matrix, since rj = rj,, with ri, 1.0. The correlation matrix is then used to determine the principal components of organism X. First, the characteristic roots (X,) of the determinantal equation =

IR-AJI

=0

are determined. Where I is the identity matrix, the X, so determined are often called eigenvalues. The eigen> X,.. A values are ordered as follows: XI > X 2> maximum of m eigenvalues exist. It can be shown that m

2AX

=

m

and further, that

zX1/m, j
0.9 test 1 is negative, and if P2 < 0.9 test 1 is positive; similarly for test 2, if P2 > 0.5 test 2 is negative, and if P2 < 0.5 test 2 is positive. This procedure can be extended to any number of such tests and any number of isolates can thus be selected at random from our hypothetical population. A sample of n such randomly selected individuals can then be used to calculate the principal component scores. The data summarized in Fig. 1 represent the distribution of first principal component scores when a sample of 500 individuals is drawn at random from a hypothetical bacterial population. In this case, the population was arbitrarily defined as follows: Pi = 0.7i = 1, 2, . . ., 16 Pi = 0.9 i = 17, 18, .. ., 48

where pi represents the probability of observing a positive result with a test i. The observed distribution of principal component scores (open circles) is compared to the theoretical Gaussian distribution based on the sample estimates of the mean and variance in Fig. 1. The statistics for the normal curve were derived from the data. Calculation of the Kolmogorov-Smirnov statistic indicated that the observed distribution of principal components 500

RESULTS Sample drawn from a single population. The usual method of describing a sample of independent bacterial isolates is by recording the percentage of isolates that respond in a certain way on a given test. When this type of definition is extended to cover all the tests considered, it represents the best definition of the population from which the sample was taken. The fact that not all of the isolates are identical is implicit in this type of definition. Less obvious is the assumption that the variation seen in the sample is due to independent and random events (mutations). With this concept in mind, it is possible to define hypothetical populations and draw samples at random from these populations.

X 400 D

a>

300

r 200

Z

100

-1.85 -3.35 -0.35 +1.15 FIRST PRINCIPAL COMPONENT

+2.65

FIG. 1. Cumulative frequency distribution of first principal component scores. A sample of 500 individuals was chosen at random from a single hypothetical population. The solid line is the theoretical normal distribution. 0, Cumulative number of individuals.

VOL. 30, 1975

ANALYSIS OF INFRASPECIFIC VARIATION IN BACTERIA

did not differ significantly from a normal distribution. Two factors that could conceivably influence the outcome of a principal component analysis are sample size (n) and number of variables (m). Figures 2 and 3 indicate the influence of these two variables on the distribution of principal components. Figure 2 represents a cumulative frequency distribution of first principal component scores based on 500 isolates drawn from a single hypothetical population. The population parameters were: pi = 0.9 i = 1, 2, . . ., m/3 Pi = 0.5 i = m/3 + 1, m/3 +2, .. .,2m/3 pi = 0.1 i = 2m/3 + 1, 2m/3 + 2,..., m where m was allowed to vary, depending on the experiment. The sigmoidal nature of the curve is indicative of a normal distribution. Furthermore, within the indicated limits the number of variables had very little effect on the overall distribution. In all three cases the distribution did not differ significantly from nc)rmal. The influence of a sample sizee was determined by drawing random samplees of varying size from a hypothetical populaticin and determining the results on a set of m = 48 tests. The parameters were: pi = 0.9 i = 1,2,..., 1E pi = 0.5 i = 17, 18, .. .. 32 Pi = 0.1 i = 33,34,..., 48

The data were plotted on probabili paper and tyes ofof n >> were summarized in Fig. 3. For sar nples 100, the data could be easily fitte4d to aa single straight line, indicating a normal distribution, For a sample of n = 50, the data E a AXI o

CO) LLJ J

m= 21

39 48

m = o m =

A

0

LL

R

0

m

CD

a

z L

Principal component analysis of infraspecific variation in bacteria.

In certain types of ecological investigations it may be desirable to investigate infraspecific variation in bacteria. Principal component analysis is ...
1MB Sizes 0 Downloads 0 Views