Compur. Biol. Med.
Pergamon
Press
1975. Vol. 5, pp. 21-X
NORMALIZATION
Prmted
m Great
Britain
OF CHROMOSOME A NEW METHOD
MEASUREMENTS:
DAN H. MOOREII Biomedical
Division,
University
(Received
of California, Lawrence Livermore California 94550. U.S.A.
7 January
Laboratory,
Livermore,
1974 and irl revised form 6 August 1974)
new method is proposed for normalizing chromosome measurements based on the maximum likelihood principle. The method can be applied to any number of chromosomes and does not require measurements on all the chromosomes of a cell. Thus it is ideally suited for normalizing measurements from cells with partial or aberrant karyotypes. Several measures of performance for normalizers are proposed and used to evaluate the new method. Data from human metaphase chromosomes are used to illustrate the new method. Abstract-A
Normalization
Chromosomes
Automatic
karyotyping
INTRODUCTION Chromosome measurements such as length, area and DNA stain content vary widely from cell to cell due to differential staining and contraction of the chromosomes. Therefore, if one wishes to compare measurements on chromosomes from different individuals it is necessary first to remove this (within individual) cell to cell variation. The process of removing this cell to cell variation is called “normalization”. The usual method is to divide each chromosome measurement by the total of all the measurements for the cell from which it arose. However, this method has serious limitations that have been discussed in several papers (e.g. Ledley et al. [l] and Hilditch and Rutovitz [2]); basically, the problem is that the method can be applied only to cells that contain the normal complement of 46 chromosomes. Thus, it cannot be used on measurements from abnormal cells that contain other than 46 chromosomes. Nor can it be applied to measurements from normal cells for which it was not possible to obtain usable measurements on all 46 chromosomes. Various other methods for normalization have been proposed and reviewed by Hilditch and Rutovitz [a]. They also proposed a normalization method of their own which they show to be superior to any of the traditional methods. This paper reviews the problem of normalization from a mathematical-statistical viewpoint and presents a new method for normalization. This method is based on the maximum likelihood principle and minimizes the cell to cell variation in the measurements. In addition, several criteria are proposed for evaluating the effectiveness of normalization procedures. This method as well as its performance is illustrated on some human chromosome data. NORMALIZATION
METHOD
To simplify explanation of the normalization method, assume that measurements have been made on II metaphase cells, each containing m chromosomes. Let X, denote the measurement on the ith chromosome (i = 1,2,. . m) from the jth metaphase cell (j = 1,2,. . . 21
DAN H. MOOREII
22
n). Let Ci represent the multiplicative normalizer which is applied to all m chromosome measurements of thelth cell. Also, let pi and crf be the true (population) mean and variance of the normalized measurements (Cj Xij) on the ith chromosome. The goal of normalization is to find Cj so that the differences between the normalized measurements and the true means are as small as possible. Algebraicly, these differences can be written dij =
CjXij - pi.
In general, the size of the difference dij increases with the size of the mean Iii, so that it is appropriate to determine Cj so that the weighted sum S = ~ (CjXij - ~i)2/Oz 1
i=
(1)
is minimized. It can be shown (see Appendix) that minimizing S also maximizes the log of the likelihood function when the measurements are assumed to be normally and independently distributed random variables. Thus we call this method maximum likelihood (ML) normalization. Setting the derivative of S (with respect to Cj) equal to zero yields the solution c,
=
zi(xijPilOf)
J
(Xij/Oi)” .
pi
(2)
In practical applications the true means pi and the variances cr’ are unknown and must be estimated from data. This leads to the following iterative procedure for normalizing measurements (see flowchart in Fig. 1): 1. Find the sample means Xi = C Xij/?l J
and the sample variances So = C (Xij - Xi)‘/(i~ - l), j for each chromosome (i = 1,2,. . . m). 2. Using Xi and Sf as estimates of pi and a;, find Cj according (2). 3. Form corrected values
to equation
X!!’ = C!“X.. 1J
J
IJ’
where the superscript 1 refers to the results of the first iteration. Now calculate corrected means
and corrected variances (S:)(l) = C(X’iJ-!)- Xi’))‘/(n - 1). I
Steps 2 and 3 are repeated until the corrected values X$j”’stabilize (i.e. X$) differ little from X~~-” or. equivalently, C’(i”’converge to 1.0). In practice this usually occurs after two or three iterations. Once the corrected values have stabilized, the value for the normalizer
Normalization
of chromosome
measurements
23
Cj may be found by dividing the final corrected value X$) for any chromosome i by the original, unnormalized value X,. Finally, all measurements are resealed so that the sum of the autosome means is 100 units. This procedure for normalization is similar to one proposed by Hilditch and Rutovitz [2] although theirs was derived from a different criterion. Their method was derived by minimizing the variance of the normalizer Cj rather than by minimizing the variance of the normalized measurements. Also, their derivation requires an assumption that the measurements be independent. This restriction is not required in the present derivation, although independence among the measurements does ensure that the normalizer given in equation (2) maximizes the likelihood (see Appendix). Finally, since Hilditch and Rutovitz do not discuss estimation of ,ui and crt it is not clear how to use their procedure when these parameters are unknown. NORMALIZATION OR
IN
ABNORMAL
INCOMPLETE CELLS
In the preceding section it was assumed that m chromosomes were measured in each cell. For normal cells, m = 46, but in practice only the 44 autosomes are measured for the determinion of Cj, so as to avoid problems with the differences between male and female sex chromosomes. However, measurements on ariy chromosome can be eliminated from the calculations to determine the normalization constants Cj. For example, if it is known that some (or all) cells of a certain individual contain an abnormal chromosome, measurements on this chromosome should not be used in determining Cj. In addition, measurements are not required on an equal number of chromosomes from each metaphase cell. The index 111can be replaced with nij to denote the number of measurements in the jth cell. For example, if a measurement on one chromosome in cell j = 1 is missing, then m, = 43, not 44, and normalization is still possible. Thus we see that the minimum requirements for this procedure are: 1. at least one chromosome must be identified in each cell; 2. there must be at least two measurements, among all II cells, of each identified chromosome. The first requirement is no serious limitation since in most preparations the number 1 chromosomes are easily identified by their size and shape. The second requirement is easily met by sampling cells until the minimum number of measurements is obtained on identified chromosomes. Obviously, chromosomes that cannot be identified positively can still be normalized once the constants Cj are determined from measurements on the remaining (identified) chromosomes. CORRECTIONS
TO
NORMALIZER
It may be necessary to apply corrections to the normalizer Cj when there is evidence that the variable being normalized is affected by differential chromosomal contraction. For example, Neutrath [3] and Ledley [l] report a significant regression of length on total length of all chromosomes in a cell. Regression effects may be removed by the addition or subtraction of another constant, Ri, whose size is determined by the regression of chromosome i length on total cell length. An explanation of the method for determining the R;s can be found in either report. Fortunately, regression effects are insignificant for optical density measurements (e.g. less than 0.2 per cent of the variation in o.d. measurements on number 1 chromosomes is accounted for by regression on total cellular o.d.).
DAN H. MOOREII
24
EVALUATION
METHODS
A simple measure of the effectiveness of normalization
is the within-group variance
So = C (Yij - Pi)‘/(ni - l), where ~j is the normalized measurement on the ith chromosome from the jth cell, F is the mean of the normalized measurements on the ith chromosome and 11~is the number of i chromosomes. With the constraint that Ciyi = 100 for the 44 autosome means, smaller values of Sf indicate better normalization. An overall measure is provided by the weighted nrealr standard deviation
S = 1 nisi/C I
I
A second measure of normalization
n,.
efficiency is the total log likelihood (see Appendix)
TLL = c c (Yij - ~J’/S; 1 j
+ log $ .
This measure will, in general, increase with the total sample size N = Ci ni. Thus a more useful means for comparing normalization in different experiments is provided by the average log likelihood per measurement. % = TLL/N. Since one of the main reasons for normalizing is to increase the ability to distinguish between chromosomes of different karyotype number, a third measure of effectiveness is the ratio
F=
ci
ci cj cj
(x - T.)‘/Wl
(yij
-
Ti)2/Ci (Hi
-
1)
’
where
P = C niri/C n,. I
I
Those familiar with the analysis of variance technique will recognize F as the ratio used to test whether the 22 autosome means are from the same population. The test itself is not very informative since F is significant even for the unnormalized measurements, due to the large differences between the large and small chromosomes. Again, the value for F will increase with total sample size N, so that the suggested evaluation measure is the average Ffbr separability _ = FIN.
COMPARISONS
ON
ACTUAL
DATA
Blood samples were obtained from four males who had no known medical defects. Metaphase chromosomes were prepared, identified by banding, and restained and scanned by CYDAC (see Mendelsohn et al. [4,5]). The optical density was measured for each chromosome; these measurements are unnormalized measurements. The measurements in each complete cell (i.e. cells with useable measurements on all 44 autosomes) are then normalized by the autosome total for that cell (the “standard” method for normalizing measurements). The results here are called autosome normalized measurements. The unnormalized
Normalization
of chromosome
measurements
25
data also are normalized by the maximum likelihood (ML) procedure, described earlier, and given the name ML normalized measurements. Summary statistics for the three types of measurements are shown in Table 1. First, with either method of normalization, note the large reduction in cell-to-cell variability. For these data, the average standard deviation over the four experiments is reduced by 50.4 per cent for normalization by autosome total and 51.6 per cent for normalization by the ML procedure. After normalization, the data more closely follow a normal distribution. This is shown by the increase in the average log likelihood in the table: this average increase is 49.6 per cent for autosome normalization and 49.0 per cent for ML normalization over the four experiments. Normalization also increases the ability to distinguish the chromosomes using only optical density measurements, as is shown by the large increase in the average F for separability. The average increase over the four data sets was 289 per cent for autosome normalization and 309 per cent for ML normalization. In addition, the ML procedure allows all 1177 chromosome measurements to be used, whereas for normalization by autosome totals, only 827 (70 per cent) of the sample is useable. Convergence of the ML procedure is rapid, with at most three iterations required for C,j”’to be within 0.01 of 1.0. The summary statistics in Table 1 show that when all of the chromosomes are identifiable, the ML normalization procedure compares favorably with the usual method for normalization. To test the effectiveness of ML normalization when very few of the chromosomes are identifiable, normalization was accomplished using only information from the number 1 chromosomes. This simulates the performance of ML normalization when only the number 1 chromosomes are identifiable and used to determine the values for Cj. Once the Cj are determined, measurements on all chromosomes can then be normalized and Table 1. Summary
statistics
for unnormalized some data
Total Subject BHM
b
DHM
“nnorm.
b
MLM
5
JM
9
Totals
and
2 6
274
.I19
1.75
2.52
230
,085
2.10
4.97
274
,082
2.11
5. 33
Unnorm.
2b8
.I73
1.37
1.17
Auto. M. I..
138 268
.069 ,072
2.32 2.25
7. 54 6. 84
“NVXTL
228
,188
1.33
0.93
AlaO.
184
.097
2.05
3.30
M.
L.
228
.096
2. 04
3.44
“lln0ZTXl.
407
.179
I.34
1.11
Auto.
275
.075
2.19
6.44
M. L.
407
.075
2. 16
6. 73
1177
.lb5
“nll0rIlI. Auto. M. I..
number
44
autosome
normalized
?Normalized
of
usable
data
for
means
827
maximum
082
1177
,080
1.44
1.42
2. 16
5. 52
2.15
5.79
chromosomes
each is
individual 100;
data.
by
Avg. F
Aut0.t
Averages
*N I total
*+
Avg. LL
s. D.
chromo-
M.L.t
Weighted
+t;Unnormalized
Avg.
N>’
Method
cells
and normalized
likelihood
this
are allows
scaled
so
direct
comparisons
that
the
sum with
of
the
DAN H. MOORE II
26
Table 2. Summary
statistics
for M.L. normalization
AlltOSubject
Total
somes
cells
used
N
All
274
6
BHM
#l DHM
6
11
All
268
#1 MLM
5
12
All
228
#I JM
9
10
All
407
#1 Totals
and
2 6
18
All
Weighted
1177
#1
51
Differences
in
Avg.
Avg.
Avg.
Normalization
S.D.
LL
F
Avg.
Max.
,082
2.11 5.33 1.4% Z.-i%
,087
2.05
.072
2.25
6. 84
.O83
2.11
4.94
,096
2.04
3.44
,109
1.87
2.83 6. 73
4. 64
,075
2.16
.082
2.09
5.60
,080
2.15
5.79
.089
2.04
4. 69
1.4%
3.4%
1.7%
5. 7%
1.5%
3.4%
1. 5%
Averages
summary statistics calculated. The results are shown in Table 2 for the same data as used for Table 1. As expected, the average standard deviation is increased (10.8 per cent) when normalization is based on fewer chromosomes. There are corresponding decreases in average log likelihood (5.1 per cent) and average F for separability (18.9 per cent). The last two columns of the table show the average and maximum differences in the computed values for Cj. On the average, the value for Cj based on number 1 chromosomes alone differs Table 3. Mean and standard
BHM Chrome.
Mea*
deviations for M.L. normalized measurements DHM
Mean
s. D.
Mean
4.22
.ll
4.30
1
4.31
.lO
2
4.24
s. D.
Mean
density
Pooled
JM
MIA4
s. D.
optical
s. D.
Mean
s. D
.24
4.29
.08
4. 38
.16
4.21
.lO
4.22
.13
14
3
3. 53
.14 4.25 .lO 3.46
.08
3.55
3.51
.12
3.51
.11
4
3.35
.11
3.30
.08
3.33
.11
3.34
.09
3.33
.I0
5
3.19
.I3
3. 16
.I2
3. 16
.12
3.23
.I,
3.19
.12
6
3.02
.08
3.02
.oa
2.95
.18
2.99
.lO
3.00
.11
7
2.73
.ll
2.77
2.78
.06
2.79
.09
2.77
.08
8
2.57
.ll
2. 56
2.48
.10
2.56
.10
2. 55
.10
2.37
.lO
2.34
.07
2.34
.08
2.39
.14
2.35
.08
2. 36
.09
.lO
2.33
.06
2. 34
.09 .09 .09 .07 .07 .06 .06 .05 .05
12 4.18
14
9
2.33
.08
2. 34
10
2.33
.07
2. 36
.06 .09 .05 .07
11
2.33
.08
2.33
.11
2.35
12
2. 36
.08
2.33
.07
2.31
.08
7.. 31
.lO
2.32
13
1. 86
1.88
.08
1.98
.09
1.85
.07
1.88
14
1.80
.I2 .05
1.79
1.80
.I1
1.73
.07 I. 77
15
1.68
.08
1. 72
16
1. 55
1.61
17
1.48
18
1.41
1.18
.06 .06 .03 .05 .03 . 06 .04
1.70
.09
1.71
1.59
.05
1.62
.07
1. 59
1.45
.07
1.48
.05
1.47
1.40
.06
1.43
.06
1.41
1.06
.05
1.08
1.16
.02
1.16
.06
.07
0.81
.04 0.80
0.89
.05
0. 89
1.74
19
1.07
20
1.18
.08 .09 .05 .05 .03
21
0.81
.08
0. 80
.05
0.77
22
0. 87
.04
0. 88
0.91
x
2.68
. 02
2.60
Y
0.92
.06
0.94
.06 .06 .06
1.48 1.39 1.11
04
@4
04
1.08 1. 17
2.71
.05
2.63
.07
2. 65
1.10
.08
0.92
.O5
0. 96
04 06
.05 .05 .06
Normalization
of chromosome
27
measurements
from its value based on all 44 autosomes by less than 2 per cent. The changes shown in Table 2 are small and show that ML normalization is robust to changes in the number of chromosomes used in determining the normalization constants, Cj. Finally, to facilitate comparisons with data from other studies, Table 3 shows means and standard deviations for our optical density measurements. SUMMARY A new procedure for minimizing cell to cell variation in chromosome measurements is derived and is tested on samples from four normal male subjects. Three statistics are suggested for evaluating the effectiveness of normalization procedures and are used to show that the new procedure is superior to the usual method (dividing each measurement by the total for the cell from which it arose). The new procedure is iterative and can easily be used to normalize data from incbmplete and/or abnormal cells. It can be applied to any of the measurements now being made on chromosomes such as length, area and DNA stain content. Ackno,vlrcigrnlerlts-This work was performed USPHS GRANT 7 ROl GM 20291.
under
the auspices
of the U.S. Atomic
Energy
Commission
and
REFERENCES to automatic chromosome analysis, Comput. Biol. Med. 2, 107-128 (1972). C. J. Hilditch and D. Rutovitz, Normalization of chromosome measurements, Comput. Biol. Med. 2, 167-I 79 (1972). P. W. Neurath, B. Kess and D. A. Low, Individualized human karyotyping through quantitative analysis. Co/nptrt. Biol. Med. 2, 181-193 (1972). M. L. Mendelsohn and B. H. Mayall, Chromosome identification by image analysis and quantitative cytochemistry, in Hurnar~ Chromosome Methodology. J. Yunis, Ed., Academic Press, New York (1973). M. L. Mendelsohn. B. H. Mayall, E. Bogart, D. H. Moore II and B. H. Perry, DNA content and DNA-based centromeric index of the 24 human chromosomes, Science 179, 1126-l 129 (1973).
I. R. S. Ledley, H. A. Lubs and F. H. Ruddle, Introduction 2. 3. 4. 5.
ABOUT
THE
AUTHOR
DAN HOUSTON MOORE II earned his B.A. at the University of California Santa Barbara in 1963. He received a Ph.D. in biostatistics from the Univ. of Calif. Berkeley in September 1970. From 1970 to 1972 he worked as a research associate in the Depar’tment of Radiology at the University of Pennsylvania. While there he developed a statistical test for distinguishing between pairs of homologous chromosomes. He joined the staff as a biostatistician in the Biomedical Division of the Lawrence Livermore Laboratory in September 1972, where he is continuing work on the development and application of statistical methods to chromosome research.
APPENDIX Under
the assumption
that the normalized
measurements .xj = c,x,
are normally and independently xi is given by
distributed L(xj)
with known
means
pi and known
= (2 7~uf)-~” exp[ - l/2 (Kj - p$/uf].
variances
~2, the likelihood
for (Al)
DAN H. MOORE II
28 If the observations
are independent,
the joint likelihood
for all of the observations
Yj is given by
UUL(Y,J. The log likelihood into (A2) is
is obtained
by taking
cc{
i i
the log of expression
_1,2(Yij
- /_L,)‘/crf -
(A2) (A2) which,
1’210g(2~)logii,).
upon substitution
of expression
(Al)
(A3)
Maximizing the likelihood is accomplished by minimizing the negative of expression (A3). Substitution of C,S,j for xj in expression (A3) followed by differentiation with respect to Cj and setting this derivative equal to zero leads to equation (2) of the text.