123

Review papers

Design and analysis of reliability studies Graham Dunn Department of Biostatistics and Computing, Institute of Psychiatry, London

This review covers the design and analysis of essentially two types of reliability study: method comparison studies and generalizability (including inter-rater reliability) experiments. Likelihood-based methods of inference (confirmatory factor analysis and REML estimation of variance components, for example) are advocated, partly because of their ease of use but, primarily as a stimulus to the use of more ambitious designs for the investigation of the quality of measurements. The more sophisticated approaches are not intended to replace the simpler traditional methods, however, but are expected to be used to supplement them.

1

Introduction

an assessment of the potential role of modern computer-intensive statistical methods in the evaluation of measurement quality it is instructive to look at the way in which simpler methods reveal informative results. Modern practice should ideally be a blend of the older traditional methods and the newer potentially more sophisticated ways of looking at, and describing, data. In terms of routine clinical research it is perhaps better to stress the correct use of the simpler methods whilst, at the same time, pointing out their limitations. This appears to be the approach taken in the influential paper by Bland and Altman. On the other hand, many clinical research workers and all medical statisticians ought to be aware of new developments both in medical statistics and in areas of statistical research related to measurement in other disciplines (chemistry, engineering and the social and behavioural sciences, for example). The latter view is stressed in a recent textbook by the present author.2 In this article an attempt will be made to combine both lines of thought.

In

2

Historical background

Bibliographies on errors of clinical measurement ands observer variation are provided by Fletcher and Oldham,3 Koran4 and Feinstein.5 Although the investigation of measurement errors was studied by nineteenth-century physical scientists the earliest clinical example provided by Fletcher and Oldham3 is an example from neuropathology. Dunlop9 was concerned in sources of variation in the average number of cells per field Address for correspondence: Dr G Dunn, Department of Biostatistics and Computing, Institute of Psychiatry, de Crespigny Park, Camberwell, London SE5 8AF, UK.

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

124

in the outer nerve cell layers in the first frontal convolutions from the post mortem brains of five control cases and eight cases of dementia praecox (schizophrenia). Three independent observers counted nerve cells for microscopic fields obtained for three nerve cell layers from each of the 13 subjects. Dunlop’s results are reproduced in Table 1. A much more thorough examination of counting errors in haematology was undertaken by Berkson, Magath and Hurn.l° Having empirically demonstrated that variability between fields within a single specimen within a single counting chamber fails to follow the expected Poisson distribution (the ratio of the variance to mean count for erythrocytes being 0.92 rather than the expected 1.0), these authors moved on to assess additional variability arising from pipetting errors and differences between counting chambers. The results are expressed in terms of variance components, although the authors do not explicitly use ANOVA models. Later in this article the design and analysis of essentially two types of reliability study will be discussed. The first involves the comparison of two or more different measuring instruments (method comparison studies) and the second explicitly examines multiple sources of variability in measurements (generalizability and observer reliability studies). The distinction between the two is not always clear cut but, the dichotomy does provide a convenient framework for the exploration of different strategies in experimental design and statistical analysis. An early paper by MacFarlane et al.11 describes an extraordinary experiment concerning both method comparisons and sources of observer variation in the determination of haemoglobin. Each of 16 observers made three separate estimations by each of 16 different laboratory methods in eight separate samples of blood - a total of 6144 measurements! Differences between observers were considered likely to be associated with (a) sex, (b) training, (c) the side of the dominant eye, (d) abnormalities of colour vision, and (e) the effect of fatigue. Of the 16 observers taking part in the experiment, eight were men and eight were women. Four men and four women were trained laboratory workers. The remainder were relatively unfamiliar with the estimaTable 1

Average neuronal

cell counts in

schizophrenic brains9

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

125

tion of haemoglobin. In each of the four above subcategories two observers were righteyed and two left-eyed. Finally, in the male group two observers were red-green colourblind. The effect of fatigue was considered to be a confounding factor. To allow for the influence of observer fatigue the order of the methods by which each observer made his measurements was varied by a regular system with each successive blood sample, so that all methods were used with equal frequency in all positions in the order of use. Sadly, a full analysis of the results of the above experiment seems never to have been published. A tantalizing glimpse of what might have been done is provided in the opening sentence of the authors’ discussion: ’The experiment was so planned, with the help of Prof. R.A. Fisher that the various factors concerned in producing the over-all variability of the results of haemoglobin estimation by several observers could be assessed separately.’ Today, such data might be summarized by estimates of variance components. Fisher12,13 had introduced the ideas of analysis of variance, and Tippettl4 had extended the methods to explicitly estimate variance components. The historical development of variance components analysis is provided by Khuri and Sahai.15 Precision studies for clinical measurements are usually much less ambitious than the haemoglobin experiment described above. Nichols and Bailey,16 for example, describe simple experiment in which four doctors were asked to independently measure both the right and left legs of each of 50 patients. The right-left difference was then calculated for each patient and the resulting 200 differences subjected to a two-way analysis of variance. Treating both patients and observers as random effects, components of variance were estimated for patients, observers and within-observer error. At the opposite extreme to the above haemoglobin experiment in terms of design complexity, is a series of studies in the 1940s and 1950s involving the statistician Jacob Yerushalmy. These are summarized in Yerushalmy .17 These studies are characterized by their thoroughness and in particular their large sample sizes. In the initial study, for example, 1256 14 x 17&dquo; cellulose acetate base films were independently interpreted by five clinicians. 18 In a second study, 1800 70mm photofluorograms were interpreted independently by six readers.19 The methods of analysis were not technically sophisticated but involved the use of simple descriptive measures such as the relative proportion of ’cases’ detected by each of the observers together with counts of the number of agreements between pairs of raters or between duplicate assessments by the same observer. These studies illustrate what can be learnt from good quality data without the use of complicated statistical procedures. Psychiatry is perhaps the speciality in which most work has been done or observer agreement. Much of this has involved analysis of data using kappa and weighted kappa, the chance-corrected agreement measures proposed by Cohen.2o,2i See, for example, Spitzer et al.,22 Grove et al.23 and Shrout, Spitzer and Fleiss.24 Further discussion of kappa coefficients is provided by Kraemer in the present issue of this journal. Of more relevance to the present article is the influence of psychometric theory and correlational methods on reliability studies in psychiatry. The development of what is now known as classical test theory began with the work of Spearman25 and Brown26,27 at the beginning of present century. It was not until the 1960s, however, that these methods were routinely used in the evaluation of psychiatric rating scales - see, for example, Hamilton28 and Beck et al.29 A commentary of some of this work can be found in Dunn, Hand and Sham.3o An excellent review on psychometric methods of reliability estimation is provided by Feldt and Brennan.311 It is interesting that the psychometric methods used in psychiatric research has had

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

126

the methods used by research workers in other areas of medicine. It is perhaps puzzling why statistical work done by physical scientists has clinical research workers, particularly in the context of clinical unnoticed also gone by A few measurement. years ago the present author took part in a short training laboratory clinical biochemists. In that course I asked the participants 30 British course for about with the British and International Standards on the evaluwere familiar whether they of them methods.32,33 None ation of test appeared to be aware of their existence! In the be equally interesting to know how many it would field of method comparison studies, work and his successors.34-36 aware of the of Grubbs clinical chemists are very little influence

on

even more

3

A simple reliability study

Table 2 contains data from a simple reliability study conducted to illustrate the concepts of reliability and precision to a class of postgraduate students. Each of the 20 members of the class were asked to complete the 12-item General Health Questionnaire (GHQ)37 in the presence of the present author. They were given a duplicate questionnaire at the same time and were asked to complete this three days later; returning the completed questionnaire to the author by post. It is hoped (and, in general, often assumed!) that at the time of completing the second questionnaire the participants had forgotten their previous responses to the first questionnaire. A high GHQ-score indicates psychological distress; a low score is indicative of a lack of psychological problems. Each item has four -

-

Table 2

GHQ-12

scores:

----

---------

test-retest

reliability

experiment

* Missing observations L: late return of second questionnaire Odd: sum of odd items Even: sum of even items GHQ: total score .

:

.

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

127

possible graded responses, here coded as 0, 1, 2 and 3, and to produce the odd and even subtotals in Table 2 the appropriate item grades are added. The total score is then the of the two subtotals. The maximum range for the total score is from 0 to 36. The results are typical of a real reliability study in many ways. First, the data are obtained for a very small samples of subjects - too small. This is not always the fault of the investigator, however. There may be situations in which one might wish to evaluate measurements on patients with, for example, a particularly rare, congenital disease. There may be only 20 living cases available. Typically, however, there are many available subjects and the small samples arise from laziness, lack of resources or ignorance of the sampling behaviour of variance components or reliability coefficients. Another commonly occurring characteristic is the problem of missing values. Some subjects may have refused to take part or have been unavailable at the time of the study. Others fail to turn up for their second or subsequent examinations or interviews or, as in the present example, do not bother to return their completed questionnaires. If the reason for refusal to participate, or for failure to supply follow-up information, is related to the characteristics of the subjects, then this will be a source of bias in the results. Returning to the problem of sample size, if the main concern was estimation of the variability of repeated measurements in the same subjects, then 20 subjects might not be too small a sample. In this case it might be preferable to try to get more repeated measurements per subject rather than struggle to find more people to test. These design issues will be discussed in more detail later (see, in particular, Section 5.3). It was stated earlier that memory effects (or, in general, carryover effects) are often assumed to be absent. These result in correlational measurement errors, the presence of which can also be bias precision estimates. If we simply compare the GHQ totals at the two times (looking at the standard deviation of the differences, for example) then sum

intrasubject variability (lack of stability) will be confounded with measurement error (lack of precision). The former may, however, be an important source of variation in clinical data and, if we were simply to compare subtotals for the odd and even items obtained at a single time point, we would obtain information on lack of internal consistency but nothing on lack of stability. In designing a reliability study care must be taken in deciding on what sources of variation need to be estimated, which might be confounded and which can be eliminated by the choice of experimental conditions. Returning to the GHQ data in Table 2, the sum of the odd items together with the sum of the even items provide two alternative test scores or estimates of the subjects’ state. They (the odd and the even items) could be considered as alternative forms of the test instrument and comparison of the two subtotals is analogous to a method comparison study (see next section). In the psychometric literature they are usually referred to as split-halves and the reliability of the total score (the sum of the split-half totals) is then obtained through the use of the Spearman-Brown Prophesy Formula.31 Instead of taking the odd items and comparing them with the even ones, we could have compared subtotals obtained for a random partition of the items into this equal groups of six. Better still, we could consider all possible partitions of the 12 items into two groups of six. This would lead us to the well-known alpha-coefficient as a measure of internal consistency.311 This approach, however, would not be available for chemical, physiological or anatomical measurements and it will not be pursued here. Consider the comparison of two sets of measurements made on a single sample of subjects through the use of two different measuring instruments, alternative test forms, split-halves or repeated measures using the same instrument. Too often the analysis

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

128

of such data is naive or invalid or both. It is quite common in clinical psychology, for example, to use linear regression techniques to predict total scores from subtotal scores obtained from a sample of items arising from the same test administration. Despite many warnings, clinical chemists regularly use linear regression for the relative calibration of measuring instruments. Psychologists and psychiatrists are particularly fond of using Pearson product-moment correlation as a measure of agreement between pairs of measurements, as are clinicians from other specialities. Linear regression is not appropriate when both measures are subject to error. Product-moment correlation coefficients are measures of association and are not suitable as indices of agreement. On the assumption that we do not make the obvious mistakes in the analysis of a set of data, there are still several more subtle problems for the unwary or inexperienced data analyst. If, for example, variance components or covariance components models are used for the data in Table 2, how should we cope with missing observations? Elementary texts hardly ever discuss variance components estimation let alone that for unbalanced data. Which facets of the measurement process (odd versus even items, time 1 versus time 2, subject identity) do we regard as fixed, and which random? Many of these problems will be illustrated in the following sections. Before moving on to the discussion of method comparison studies, however, we will briefly return to the discussion of the GHQ data. Considering only the total scores at the two times, and assuming that there has been no systematic change between two times, we can fit a variance components model to estimate variances for subject differences and residual measurement errors. Estimates obtained from the analysis of measurements for the nine subjects for which there is complete data (having discarded to the three late questionnaires based on the suspicion that they had not been completed on the correct day) are 23.61 (se 12.78) and 3.78 (se 1.78) for the variance components due to subjects and within-subject error, respectively. The corresponding estimates based on all valid observations (20 at time 1 plus the 9 at time 2) are 26.81 (se 9.73) and 3.79 (se 1.78), respectively. In both analyses the estimates were based on the REML (residual maximum likelihood) criterion using the REML program package.3g Note that although the error variances are hardly affected by the choice of data set to analyse, the subject component is estimated with more precision in the second analysis. If we define a reliability estimate as the ratio of the subject variance to the sum of the two variances (total variance), that is

then it should be clear that the precision of this estimate will be better in the case in which are the estimates of the we use all available information. Here û2 subject and &2 error The variance estimate will, of course, be familiar components. reliability corresponding to readers as an estimate of intraclass correlation. In the following discussion, reliability coefficients will be considered to be of secondary importance, the primary statistics being variance components and coefficients of precision. One reason for the relegation of the reliability coefficient should be clear from examination of Equation (1). It is not a fixed characteristic of a test or measuring instrument - it changes with the population of subjects being sampled. Again this is a point that is very rarely appreciated by clinicians. It is not always a problem, however, as it might be very useful as an indicator of how one particular measuring instrument will perform in a particular clinical or epidemiological

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

129

setting. The same is true of kappa coefficients; kappa coefficients and intra-class correbeing equivalent summary statistics for many inter-rater reliability studies.2~39

lations

z

4

Method comparison studies

.’

4.1 Introduction In a method comparison study one is primarily interested in assessing the relative performance of two or more different measurement techniques whilst, at the same time, trying to control extraneous or confounding sources of measurement variability. The resulting variability in the measurements observed is inferred to come from either systematic relative biases arising from the use of one particular technique as opposed to another, or from lack of precision. For the time being, we will interpret the word ’precision’ as meaning the inverse of the variance or of the standard error of the randomly fluctuating component of measurements made by a given technique. Depending on the context, the different measurement methods may correspond to different analytical methods, different analysts, technicians, observers or raters, different laboratories or clinics, or different instruments (including interviews or questionnaires) or pieces of equipment. In his monograph on method comparison studies Jaech36 has described the following experimental situation:

A number of items (n) are each measured once for the same characteristics by each of N experimental methods. The value of this characteristic for a given item does not change during the experiment. Thus, the data consist of nN observatins. &dquo; e’ .&dquo;, ,., .

.’

This then, describes the basic method comparison study. It will be the starting point for the discussion which follows but, the idea of the study will be expanded where necessary to incorporate replicated measurements, lack of balance (missing values either by accident or design) and the possibility of temporal fluctuations in the characteristic being measured (as in a longitudinal panel study). Missing values do not appear to be a problem in Jaech’s laboratories and, although he acknowledges that, for some methods at least, replicate measurements may be performed, only the mean of the replicates for each item is reported and used in the statistical analysis. This is unnecessarily restrictive and may be an inefficient use of the available data. Before moving on to a detailed discussion of statistical methodology, it is important to consider the method of selection of the items or subjects whose characteristics are to be measured. Jaech36 distinguishes two alternative sampling strategies. In the first, the random model, items selected for measurement are drawn randomly from a population in which the characteristic is often assumed to be normally distributed. In the second, the fixed model, the items are selected in a systematic way to span some required range. In the latter case, for example, one might choose to take samples from discrete diagnostic groups together with one for the population of ’normal’ controls. Unless it is otherwise stated, the discussion below will be concerned with measurements assumed to have been made on a simple random sample of items or subjects. In many practical situations, however, this assumption is clearly unwarranted! .

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

130

The basic measurement model

4.2

Following Jaech36 let Xtk represent an observed measurement value for item (subject) n and 1 N. Let ILk be the true, but k using method 1 where k 1,2, 1,2, unknown, value for item k. The statistical model relating Xtk to ILk has the following =

=

...,

...,

form:

-

.

where ai and Pi are parameters which jointly describe the measurement bias for method i, and where Eik is the random error of measurement arising from this particular use of method 1 on item k. It will be assumed for the present that Etk is normally distributed with mean zero and fixed variance of ate this assumption will be relaxed later to allow for co-variation of QZ2and JLk). That is, and

It is also assumed that

1.1.

and

In the psychometric literature this is described as the congereric tests model.40 It is also a specific example of the single common factor model, first proposed by Charles Spearman in 1904.41 From the above assumptions, it follows that, for a population of items and a given

method (i),



and

where > true

is the expected value (over items) of ILk, and uj is the population variance of the

value, Ilk -

In a method comparison study one is basically assuming the truth of a measurement model such as that described in Equation (2) and then comparing observed summary statistics (means, variances and covariances) to their expected values in order to make inferences concerning the values of the model parameters. Some authors, however, may be less ambitious than this and may simply wish to know how well the results of using one measurement technique agree with the results from another. If, however, it is our aim to infer parameter values then it is important to first consider their meaning and

interpretation. If ai

Pi

=

1,

0 and ~3Z 1, then the method i is said to be unbiased. If ai does not equal 0 but method 1 is said to have a constant bias. For two methods i and j if at is not then

=

=

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

131

equal to a , but (3~ = (3~ 1, then the two methods have a constant relative bias measured by the difference ai - aj’ Returning to method i, if ~3i does not equal 1, then method I has a nonconstant bias (that is, the bias either increases or decreases with increasing values of ILk). Finally, for two methods, i and j, if (3i does not equal (3. then these two methods are said to have a nonconstant relative bias measured by the ratio )3/~ (or its reciprocal). There is a problem with these definitions, however. As long as we never have access to the truth (that is, the ILk values) we can never estimate either the as or the (3s. We can only estimate their differences or ratios, respectively. And it is important to recognize that we can never, in fact, have access to the truth in the platonic sense. The problem can be solved in several alternative (and mathematically equivalent) ways. Here we will consider two. Psychometricians usually set the scale of measurement by imposing the 1. Alternatively, it may be more attractive to regard 0 and Q2 two constraints E(ILk) methods as a standard (method 1, say) and let al = 0 and ~1 one of the measurement now measure 1. The other as and (3s systematic biases with respect to the standard. In latter constraints will be used. Now consider the precisions the the following discussion, measurement methods. of the different Jaech36 states that the methods have common standard error of measurement for methods all i. If the same for precisions if g2is the and as i and j are defined cri cr, respectively, this is equivalent to stating that methods have the same i and j precision if ui aj. The precision of methods of course, has an to inverse relationship Q2 Now equating precision with, say 1/02, is unsatisfactory as it into account of the possibility of differing scales of measurement being used not take does in the differing measuring instruments (that is, differing ~3s). It would be unsatisfactory, for example, for the comparison of the precisions of two thermometers, one calibrated in degrees Fahrenheit, and the other in degrees Centigrade. Cochran,42 quoting an unpublished manuscript of Frederick Mosteller, implies that precision of an instrument should be expressed as the ratio =

=

=

=

=

This is the definition used by Shyr and Gleser.43 Theobald and Mallinson,44 on the other hand, use the term relative precision equivalent to

If we wish to compare an instrument or method, say i with the standard method then, .. following the terminology of Shyr and Gleser,43 we have .

This is independent of U2. Finally, let us consider what is meant by the term reliability in the context of the measurement model described by Equation (2) and the accompanying assumptions. In the psychometricians’ classical test theory, we have

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

132

where T,k is simply the expected value of

Xtk over k for fixed method 1. Here

Reliability is defined as

Note that it is not a fixed characteristic of a method, it is dependent on the variability of the items and on the standard error of measurement of the method. It can be interpreted as the proportion of the variability of the observed measurements that is explained by the error-free component of variability in the items being measured. Now, Equation (2) (the congeneric tests model) appears to be quite a satisfactory starting point for the description of method comparison data. It is, however, often an oversimplification in at least two ways. First, it is quite often the case that the error variances (uils) are not independent of the characteristics (the ILkS) being measured. This might be taken into account by a prior variance-stabilizing transformation or, alternatively, it may be explicitly incorporated into the statistical model and resulting inferential methods. This will be discussed later. The second complication arises from the fact that the item or subject being measured almost invariably has some property or characteristic, other than that being measured, which ’interferes’ with the measurement process. The amount of interference will be characteristic of the measuring instrument or method under consideration. This interference or non-specificity is often referred to as an item-specific bias. It appears to have been first discussed by Fairfield Smith45 and is well documented in the literature of analytical chemistry46 and clinical chemistry.4~,4g It was discussed by Cochran,42 again referring to an unpublished manuscript on an interater reliability study by Mosteller. In the present author’s experience it is certainly not simply a problem that is solely a characteristic of chemical measurements. It appears to be displayed in the lung-function data of Barnett49,50 which has also been analysed by other authors.51,52 It is also illustrated by both the CAT scan and EEG data discussed by Dunn.2 These CAT scan data will be used for illustrative purposes in the next section. The analysis of several sets of EEG data is given in Besag et al.53 In the context of ANOVA models for interviewer or rater reliability studies, the above item-specific biases are equivalent to interviewer-subject or

rater-subject interactions. 54,55 A more realistic statistical model is therefore,

given by

where Xijk represents the jth replicate on the kth item or subject using the ith measuring device or method. The term 8Zk is that component of the measurement error which is common to all replicates for subject k using method i (i.e. the item-specific bias). The term Eijk is the residual error term which is unique to that particular measurement. The

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

133

other terms are the same as for the model described by Equation (2). The variance of the 8s is Q~Z, whilst that of the es is u5. The effective precision of method i is now

in the case of a single replicate,

or

in the

case of the mean of m replicates. Note that in the absence of replication, the two terms, 8 and E, will always be confounded. In the following discussion it will usually be assumed that both uli and uE2i will be constant over all the values of ILk under consideration. Where this assumption is relaxed, it will be stated explicitly. error

,

Comparison of two measurement methods: inference of the following discussion we will be concerned with the comparison of the performance of just two instruments or techniques. Usually we will be concerned with the comparison of a new method with an already well-established or standard method. Table 3 provides measurements derived for computerized axial tomographic scans (CAT scans) of the heads of 50 psychiatric patients.2,56 The primary aim of these scans was to determine the size of the brain ventricle relative to the patient’s skull (the ventriclebrain ratio or VBR is equal to {(ventricle size/brain size) x 100} . For a given scan or ’slice’ the VBR was determined from measurements of the perimeter of the patient’s ventricle together with the perimeter of the inner surface of the skull. The measurements were made using (a) a hand-held planimeter on a projection of the X-ray image, or (b)

4.3 In

.

most

Table 3

CAT scan data:

logged VBRs from 50 patientS2

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

134

for an automated pixel count based on the image displayed on a video-display screen. Table 3 gives the logarithms of the VBRs for single scans from 50 patients. The first two columns correspond to repeated determinations based on pixel counts and the second two columns corresponds to repeated determinations based on the use of the planimeter. Here the planimeter can be regarded as the standard method whilst the pixel count is the new one.

Comparable data on the measurement of peak expiratory flow rate (PEFR) are given by Bland and Altman.l Here two measurements were made on each patient using a Wright peak flow meter and two with a mini Wright meter. Strike48 presents duplicate assay results associated with the comparison of two methods of measurement of gentamicin. Each of 56 serum specimens were assayed twice by an existing EMIT method and twice by a new FIA method. Returning to Table 3, we can begin the analysis of the results by first investigating the repeatability of each of the two methods separately. The data for the first round of measurements can be plotted against the second ones and, following the recommendations of Bland and Altmanl the difference between the two can be plotted against their mean. These graphical methods can be extremely useful in (a) allowing us to look for any systematic biases, (b) checking whether the variability (precision) of the methods appears to be related to the size and the characteristic being measured, and (c) looking for outliers, or, what HealyS7 has referred to as ’blunders’. The variances of the measurement errors can be estimated by halving the variance of the above differences or through the use of two separate one-way analyses of variance. In the latter case the within-subject mean squares provide the required estimates, whilst the corresponding reliabilities are provided by the following estimates of intraclass correlation:

where BMS is the between subjects mean square and WMS the corresponding within subjects mean square. This familiar expression is derived, for example, in Dunn2 and Fleiss . 58 The results are given in Table 4. It appears that the pixel method is much more precise than the use of planimetry. Note, however, that the pixel measures appear, from looking at the total and between-subjects sums of squares, to be much more variable than the planimetry measurements. We now move on to compare the means of the two pixel measures with the correTable 4

CAT scan data:

analysis of variance

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

135

sponding means for the planimetry scores. Again the graphical methods of Bland and Altman can be used. The results are displayed in Figure 1. Note that if these means were the only observations available, then it would be impossible to tell which instrument is the most precise from looking at this plot. Nor would it be possible to estimate slope of the line relating the true scores for the two methods in the absence of assumptions concerning their relative precisions. This is an example of the well-known linear regression problem in which there are known to be errors in both variables.59 Here we will proceed by assuming that the relative bias for the two methods is constant (that is, the (3s are equal). This allows us to estimate the two error variances using Grubbs’34,3s methods. These are based on simply equating the variances for the pixel and planimetry measures, together with their covariance, to the respective expected values. If X represents the mean of the two planimetry measures, Y the mean of the corresponding pixel measures, and > their expected value, then

Figure 1

Diagnostic plot of logged VBR

measures

(Table 3)

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

136

and

The required estimates

are

and

then

,

where s2 and ~ are the sample (observed) variances for X and Y, respectively, and is the corresponding covariance. Note that the reliabilities are estimated by s/ /sx and

Y, respectively. If the sample moments are calculated using n as the denominator (rather than the usual n-1) then Grubbs’ estimators are also maximum likelihood

Sxy

estimates. For the present data

and

As these are variances of the measurement errors for the means of two observations, one would expect them to be roughly half the values for the corresponding mean squares in Table 4. This is not too far from the truth in the case of the planimetry measures but, a long way from it in the case of the pixel counts. They now appear to be much less precise. Before discussing this problem further, however, first consider a significance test for the equality of QE and U.2. Let U X + Y and V X - Y. Then it is straight forward to show6O that =

=

This is equal to zero if, and only if,

uix

=

u6 . One therefore constructs a test statistic

Where ruv is the observed correlation between u and v. Under the null hypothesis (u2, as Student’s t with n-2 degrees of freedom.6° There is no need u6) t is distributed calculate U and V explicitly, but note that -

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

=

to t

137

where

and

In the present example, sx 0.1575, sy 0.2732 and s,~, 0.1459. From these, ruv The -0.37 and t = -2.72. pixel measure appears to be less precise than planimetry! Returning to the error variance estimates based on Table 4, together with those based on Grubbs’ estimates, one obvious explanation is the presence of item-specific bias in the pixel measures. The variance of this bias can be estimated by subtracting half of the error variance the pixel measure in Table 4 for the estimate of U,,2,,. That is the estimate of the variance of the item-specific bias is given by =

=

=

=

The corresponding estimate for the planimetry measure is

As this is negative, we assume that this is indicative of the absence of item-specific biases in the planimetry measures. Returning to Figure 1, we might have been tempted to estimate (3 for the following model:

This is possible if we are prepared to assume that we know X X for the residual mean square given in Table 4. That is,

For a given

=

Q x /u6 . We can estimate

k the maximum likelihood estimate of (3 is given by61l

Here (3 1.83. The above estimate was first derived by Kumme162 but is usually attributed to Deming.63 If it is known that X is not fixed but varies with the size of the characteristic being measured, then the methods proposed by Nix and Dunston64 are more =

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

138

appropriate. In the presence of item-specific biases we still have problems! Here the estimator given in equation (33) is unsafe. Now let us return to the raw data and start again. To a behavioural statistician the obvious way to look at this sort of data is through the use of confirmatory factor analysis (CFA). We simply make assumptions concerning the generation of the data and then proceed to fit a series of CFA models - usually using maximum likelihood but other distribution free fitting criteria are available (see the article by Bentler and Stein in the current issue of this journal). On the basis of the preliminary investigation of the data in Table 3 we assume the following measurement model:

and 8 are all mutually uncorrelated. We constrain 0~1=0~ and, similarly, The 8 term represents the item specific bias in the pixel measurements - its variance is ul. The parameters to be estimated are Q, (3, 0:2, u;!’ u;3 and ul. If we choose to fit CFA models to the variance-covariance matrix for the four measurements then we can drop the a term for our model. The required covariance matrix is given in Table 5. Methods of fitting and constraining intercept terms in CFA models are described by Dunn2 and by Bentler and Stein in this issue. The results of fitting a several of CFA models as shown in Table 6. In all cases the models fitted using EQS.65 Note that the model described by Equation (34) fits very well. The estimate of fl is very close to 1 (unlike the inappropriate estimate produced by using Equation (33)). The model fits equally well when is, in fact, constrained to be 1 (Table 6(b)). If we now constrain U82 to be zero, however, leaving P free to be estimated, the resulting fit is very poor and the estimate of ~ is 1.86 (very close to the Deming estimate). Model (b) appears to be the best summary of the data. The advantage of fitting the CFA models is that they provide appropriate significance tests and also standard errors of the estimates. EQS can also be used for bootstrap resampling if confidence intervals these estimates or functions of them (e.g. reliability or relative precisions) are required. CFA models can also be fitted simultaneously to two or more sets of data and, this may be a particularly attractive feature when different combinations of measurements are made on different groups of subjects.2~66 Now, before leaving the CAT scan data, we will fit one more CFA model. Having

where the

Es

u;3 = u;4’

Table 5

Observed covariance matrix for the VBR measurements

logged

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

139

Table 6

Results of fitting CFA models to CAT scan data

constrained P to be 1, it is of interest to ask if the reliabilites (precisions) of the two methods are significantly different. This is equivalent to imposing the constraint uil + ul = ui3. The results are shown in Table 6(d). When compared to Table 6(b) there is a highly significant increase in chi-square. The pixel method is clearly less precise than the planimetry method. This analysis, however, has been included primarily as a further illustration of the potential of CFA modelling as opposed to the older more ad hoc methods. One should also think carefully about the effective use of graphical methods illustrated by the plots in Figure 1. If the errors-in-both-variables regression model is unidentified one must conclude that the graphical displays such as Figure 1 have less value that one might initially be tempted to believe. The plots in Figure 1 would certainly not lead on to fit Equation (34) to the data - particularly with the constraint (3 = 1. Comparison of two measurement methods: design First, we will briefly consider sample size requirements to the estimation of the standard error of measurement of a single measuring instrument or, alternatively, of two or more measuring instruments with the same standard error of measurement. If we have q replicate measurements made by a single instrument or method on a single item then, assuming normality, the approximate coefficient of variation of the standard error 4.4

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

140

of measurement estimated by the standard deviation of the replicates is {1/2(~―1)}~. Healy 57 notes that, if we need a coefficient of variation of 10% for the standard error of measurement, we must have at least 50 degrees of freedom. If we have duplicate measurements on each of n subjects then again we would expect to need measurements on at least 50 subjects (the standard error of measurement being essentially based on n differences). Readers who are more concerned with power considerations in the estimation of equivalent reliability coefficients are referred to papers by Donner and Eliasziw67 and by Eliasziw and Donner. 68 The latter considers a method of determination of the number of subjects, n, and number of repeated measurements, q, that minimize the overall cost of conducting a reliability study, while providing acceptable power for tests of hypotheses concerning the reliability coefficient, p. Returning to method comparison studies, the simplest experiment is one in which case of n subjects is measured by two alternative methods. Typically, we are concerned with the estimation of the precisions of the two methods together with a test of equality of their precisions. Let us assume that we are comparing two methods with no relative biases (that is, they only differ, if at all, in their standard errors of measurement). We could, however, relax these assumptions to allow for a fixed relative bias between the two methods without altering the conclusions given below. We also assume here that there are no item-specific biases. If we estimate the variances of the errors of the two methods (Q2for i 1,2) using Grubbs’ moment estimators, then the approximate variances of the estimates are given =

by34,36: where U2 is the variance of the true values of the items decreases as uj decreases, its minimum value being

being measured. Note that this

For a fixed value of U2 the variance of uf, for example, will also decrease with the value of v2. The minimum value (obtained when Q2 0 - that is, when method 2 is error-free) is =

now

given by

But, what about tests of the equality of uf and oz ? The power of these tests also depends on the relative size of uf and v2 (as would be expected!) and on Q~.6o Equation (28) gives a test statistic for the correlation between U (the differences between the two measurements) and V (the sum of the two measurements). Considering the population value of this correlation, puv, we have6o

The power of the test will increase with puv which, in turn, increases as (Qi - U2~ increases and as U2 decreases. For a set of (Qi, oz and uj) we can get a corresponding puv and the use tables of for example David69 to get the required power. For different values of n and Puv (or some other function of uf, o-22 and uj) power curves can be generated. Here we will

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

141

simply illustrate the effect of uj on the power of the test using a Monte Carlo simulation experiment. Table 7 shows the results of a series of simulation experiments in which the significance test described in Equation (28) was used to compare the error variances of two measuring instruments. Under each condition (valve of u2) a thousand sets of measurements were generated and the test statistic in Equation (28) calculated for each set. The power of the test was estimated by the proportion of test statistics which were significant at the 0.05 level (1-tailed). In all simulations oi 16 and az 4. It can be seen that if there is considerable variability between items being measured, then the power of the test (or experiment) is very low. As one would typically like a wide variation of the items measured in a study of this sort, this result suggests that the design of the study (that is, measurement without replication) is inadequate. Hahn and Nelsons recognizing the deficiency on the above design, showed that the situation could be remedied by replication with only one of the measurement methods. Let the measurement by the first method be represented by X and the two replicates using the second method by Y1 and Y2. Then assuming constant relative bias, we have =

=

with Var(>) uj, Var(Ex) Q x, and Var(E)’I) Var (Ey2) u6. We and Ey2 are all uncorrelated with each other. If we now calculate =

=

=

is estimated by ~ ― ~ /4 and hypothesis QE u6 is calculated as

then

Qx

u6 by s;/2.

A

=

assume

test

statistic for the

=

Table 7

*

R,

and

Results of simulated method comparison studies

R2 are the reliabilities of Method

1 and Method 2,

respectively.

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

that e~, Eyl

test

of the

142

which is compared to an F-distribution with (n-1) and n degrees of freedom. 70 Table 7 illustrates the results of simulation experiments where cr2 was varied, as before, and o~12 (Q x) = 16 and u.2 2 (u6) = 4 in all cases. Notice how the power is no longer dependent on Q2. For a typical method comparison study, this design is much more powerful than the simple one without replication. There are, however, pitfalls in the Hahn and Nelson design. The most important is the assumption of no item-specific biases. If one replicates both measurements (and produces the required estimates and tests using a CFA model) one can achieve slightly more power but, more importantly, one can test the validity of several of the required assumptions (lack of item-specific biases and of constant relative biases, for example). The pitfalls of fitting CFA models to data obtained for the equivalent of the Hahn and Nelson design are illustrated on pp 94-95 of Dunn.2 If we have replicate measurements using each of the methods more realistic measurement models can be fitted (see previous section). So far, we have assumed that the characteristic being measured is constant between one replication and the next. This is realistic for the CAT scan images discussed in the previous section but would not be appropriate where replicate measurements are taken for samples or observations on an individual over time. Healy,57 for example, has discussed biochemical and physiological changes in patients between one measurement occasion and another. Table 2 illustrates fluctuations in psychological distress (GHQ score) between an initial evaluation and a follow-up measurement four days later. This table also illustrates what sociologists would refer to as a two-wave, two-indicator panel study.2 The two indicators are the odd and even subtotals for the GHQ scores, the two waves correspond to the two measurement occasions. In general one could take unreplicated measurements using two (or more) methods on each of k (usually equally-spaced) occasions. The result would then typically be analysed though fitting appropriate CFA or structural equation models (see Bentler and Stein in this issue). The use of longitudinal panel designs allows one to simultaneously estimate components of variability arising from individual (item or subject) differences, temporal fluctuations and, of course, measurement error. A monograph by Duncan-Jones et al.71 discusses in detail the approaches to the analysis of longitudinal data on minor psychiatric morbidity. The authors discuss results from three longitudinal studies; based in Canberra,72 Christchurch73 and Groningen,74 respectively. In the Canberra study a sample of 323 subject for the general population was studied over a one-year period during which observations of minor psychiatric symptoms were obtained at regular four monthly intervals. In the Christchurch study a sample of 1248 women with school-aged children were studied annually over a four-year period using measures of depression. Finally, in the Gronigen study a sample of 258 respondents were studied over a nine-year period with observations being made in 1976, 1977 and 1984. At each occasion minor psychiatric symptoms were measured using two different, measurement instruments. Examples from the sociological literature can be 1 found on Wheaton et al.,7s Jagodzinski et al.76 and in Raffalovich and Bohrnstedt.77 4.5 More complex designs We have seen that, in a simple method comparison study involving just two measuring instruments or methods, it is not possible to calibrate one method with respect to the other without fairly restrictive assumptions concerning the behaviour of the variances of the measurement errors of the two instruments. In Section 4.3 this problem was

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

143

by replication of the measurements for each of the two instruments. In Section 4.5 we introduced the idea of a longitudinal panel survey or experiment. This also provides a solution to the problem. An alternative approach is to add one or more further measuring instruments to the method comparison study. Lewis et al.,66 for example, were primarily interested in the comparison of a lay interviewer (method 1) and a trained research psychiatrist (method 2) in measuring severity of psychiatric morbidity as assessed through the use of a semistructured psychiatric interview (the Revised Clinical Interview Schedule, CIS-R). To enable the calibration of one type of interviewer against the other through the use of a CFA model these authors also made measurements using two self-completion questionnaires (the Hospital Anxiety and Depression Scale or HAD78 and the GHQ37). Each subject, therefore, provided four severity measurements. The use of interviews or questionnaires to measure the severity of psychiatric distress or quality of life and so on, is, however, more problematic than the above paragraph would suggest. The same problems might arise in the measurement of physiological states such as blood pressure and lung function. Usually the measurements cannot be made simultaneously and have therefore to be used in some given sequence. If it is suspected that there might be learning or fatigue during the experiment or carryover (memory) for one measurement to the next then attempts might be made to allow for these problems. This might lead to the use of a crossover design. Lewis et a1.,66 in fact, used such a crossover design. A much earlier example is provided by a study of disagreement between two observers in the detection of respiratory disease.79 In the experiment described by Lewis et al.66 subjects were allocated to a given sequence of two interviews but the two supplementary questionnaires were always given in the same order during the interval between the first and second interview. There were, in fact, three interviewers (two psychiatrists and one lay interviewer) but, only two of the interviewers saw any given patient. For any two interviews there were two groups of patients corresponding to the two possible orders. For the whole experiment there were six different groups of patients; two for each of the three combinations of interviewers. The results were analysed through fitting a CFA model to the data for the six groups simultaneously. A further example of a crossover design for a method comparison study is provided by Bassein et al. 80 These authors used a 3-period crossover experiment to compare the performance of two automatic devices for the measurement of blood pressure with that of the standard mercury sphygnomanometer. The data were analysed using an analysis of variance to examine sources of variability both between and within patients. Consider a hypothetical set of measurements made in a clinical chemistry laboratory. There are four measurement procedures available: A, B, C and D. The first two, A and B, do not involve destruction of all of the material under test and can be made on all specimens of material. The second two, C and D, both involve the destruction of the specimen and each require most of the available material to be used in the assay. A particular specimen can yield either C or D, but not both. A simple design for a method comparison study might involve getting measurements A and B on all specimens and then randomly allocating specimens to be processed to give either C or D. This design would yield data for two groups of specimens. By fitting a CFA model simultaneously to these two groups the relative calibration of C and D can be undertaken, even though both C and D are never available for the same specimen.2 An analogous situation might arise in the case of a psychiatric assessment where, in this situation, A and B are the results overcome

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

144

obtained from fairly dissimilar psychiatric screening questionnaires (the GHQ and HAD, for example), but C and D are results obtained through the use of two closely related or time-consuming psychiatric interviews. Here, for reasons of cost as well as potential memory or carryover effects (correlated measurement errors), it might be thought impossible to expose subjects to both of those interviews. In the study of Lewis et al.66 the subjects were in fact exposed to both interviewers. There was some suspicion of the validity or quality of the second interview, however, and one of the approaches to the analysis of the data was to completely discard the measurements provided by the second interview and to simultaneously fit a common CFA model to the resulting three groups of subjects (corresponding to the identity of the first interviewer). In the above design measurements are missing completely at random. Another design that can introduce measurements which are missing at random is the familiar balanced incomplete blocks design (BIBD).81 In the context of a method comparison study, there may be sufficient material provided by a biopsy specimen to enable, at most, three measurements to be made. In a comparison of four methods or instruments one might proceed with a balanced design involving four groups of specimens in which each group has observations missing for one of the four methods. A computer-similated data set was used by Dunn2 to illustrate the use of CFA models in the analysis of data arising from such a BIBD. Examples of the use of analysis of variance for BIBD studies are provided by Fleiss8 and by Clare and Cairns. 82 .

Generalizability, precision and inter-rater reliability studies ... ’ ~ 5.1 Generalizability theory In a generalizability study83,84 one sets out to systematically investigate the sources of variation of measurements. A clinical measurement is assumed to be a sample from a universe of admissible observations, characterized by one or more facets. For a given generalizability study one defines a universe of admissible observations by listing the measurement conditions for each of the facets of interest. A facet (or experimental factor) can include, for example, alternative measuring instruments or test forms, different occasions, different clinics or laboratories, or different clinicians or scientists within clinics or laboratories, respectively. The different levels of a facet (such as alternative clinicians or measuring instruments) are called measurement or test conditions. Consider, for example, the experiment on haemoglobin measurement described in Section 2.11 Here, MacFarlane et al. considered five facets: type of measuring instrument (16 levels), sex of observer (two levels), training of observer (two levels) side of dominant eye (two levels) and, finally abnormalities of colour vision (two levels). They also realized that there might be an effect of fatigue but, in their experiment, the order of making the measurements was varied to allow for this instead of it being explicitly recognized as a facet of observation. These authors did not, of course, use the terminology of generalizability theorists but their aims were the same. The book by Cronbach et as. 85 includes example of generalizability studies concerning the use of clinical rating scales. A particularly well-known example is that by Gleser et al. g6 which examines sources of variation in measures of psychological distress in disaster survivors. Streiner and Cooper87 provide an introduction to the use of generalizability theory in the context of the development and use of health measurement scales. Readers are also referred to the recent review by Feldt and Brennan3l together with the primer

5

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

145

Shavelson and Webb.88 Here, instead of repeating much of this material, we will concentrate on variance components estimation. Before moving to the next section, however, it may be of use to clarify why the concept of a universe of admissible observations might be of interest. A test score or measurement on which a clinical decision is to be based is only one of many measurements which would adequately serve the same purpose. ’The decision maker is almost never interested in the response given to the particular stimulus objects or questions, to the particular tester at the particular moment of testing. Some, at least, of these conditions of measurement could be altered without making the score any less acceptable to the decision maker’85 (p.15). A clinical chemist, for example, changes his batches of reagents, may move from one measuring instrument to another, or even change laboratories, without threatening the validity of the resulting measurement.

by

there is a universe of observations, any of which would have usable basis for the decision. The ideal datum on which to base the yielded decision would be something like the person’s mean score over all acceptable observations, which we shall call his &dquo;universe score.&dquo; The investigator uses the observed score or some function of it as if it were the universe score. That is, he generalizes from sample to universe. The question of &dquo;reliability&dquo; thus resolves into ’That is

to say,

a

question of accuracy of generalization, or generalizability’85 (p.15). any given investigation in which a measurement method is used, however, a

In the decision made may choose to restrict the universe of admissible observations. Considering the measurements of haemoglobin in the study designed with the help of RA Fisher,11 subsequent investigators might choose to use a single analytical method under the control of trained observers known not to be red-green colour blind. This would still allow for variability in the observations due to the sex of the observer and to the side of his or her dominant eye (and, possibly, fatigue). In this ’decision study’ the generalizability of the measurements could still be inferred from the results of the earlier generalizability experiment. If the decision maker were to choose a different set of measurement conditions then, the generalizability of the resulting measurements would change accordingly. Generalizability, is ’simply’ inferred from the appropriate combination of selected variance components estimated for the earlier study. The precision of these variance components (or lack of it!), however, has been described as the Achilles heel of generalizability theory. 89

5.2 Estimation of variance components: inference As this is such a relatively familiar area to many applied statisticians the discussion of variance components estimation will be kept relatively brief. Typically, some facets of the measurement process will be regarded as having fixed levels, whilst others will be regarded as having been selected at random from a population of possible levels. This gives rise to the concept of the general mixed (analysis of variance) model. Searle9o provides an up-to-date survey of modern estimation methods for the mixed model. These include the traditional ANOVA methods which involve equating mean squares to their expected values. ANOVA methods appear to be the ones preferred by the generalizability theorists,88 but they do not provide a very attractive option when considering unbalanced data. Other methods include Henderson’s adaptations of ANOVA methods,91 minimum norm quadratic unbiased estimation (MINIQUE),92,93 maximum

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

146

likelihood (ML)94 and restricted or residual maximum likelihood (REML).9s,96 The latter method involves maximizing the joint likelihood of a set of orthogonal contrasts of measurements with all of the contrasts having zero expectation (error contrasts). That is, it involves the maximization of that part of the likelihood not involving any fixed effects. Readers are referred to Robinson97 and Searle9o for further details. The advantage of ML or REML estimation methods is that one easily obtains standard errors of the estimates as part of the maximization process. In the case of balanced data, REML and ANOVA estimates are identical (both, in this case being unbiased). Two further estimation methods are minimum variance quadratic unbiased estimate (MIVQUE)98 and iterative generalized least squares (IGLS).99,lOO Harville94 provides a detailed review of the properties advantages and disadvantages of the two likelihood based methods, and Swallow and Monahanloo summarize Monte Carlo investigations to compare the properties of ANOVA, MIVQUE, REML and ML estimates of variance components. Bayesians are referred to a paper by Lee102 and to the review by Khuri and Sahai.15s In the context of an inter-rater reliability study it is usual (in the behavioural and social sciences, at least) to use the variance components estimates to calculate values for reliability or generalizability coefficients,2,31 which are often equivalent to various forms of intraclass correlation.99,103,104 In an inter-laboratory precision study one is usually concerned with the estimation of repeatability and reproducibility values.32,33 The first is a function of the within-laboratory variance, whilst the latter is a function of both the within- and between-laboratory variances. In both the inter-rater reliability study and the laboratory precision experiment, and in many other experiments in which variance components are estimated, it is usually assumed that the within-rater or within-laboratory various are the same for all the raters or all laboratories, respectively. It is, however, possible to relax this assumption.97 Graphical methods of exploring within- and between-laboratory variation are discussed in the recent book by Mandel. 1°5 Various methods of outlier detection and tests of variance-homogeneity are also provided in the International Standard.32 Readers are referred to an interesting paper by Jaechlo6 in which laboratory precision experiments are considered as an example of a method

comparison study. 5.3 Estimation of variance components: design Readers who are specifically interested in laboratory precision experiments are referred to the International Standard32 or to the book by Caulcott and Boddy.33 General approaches to the study of multiple sources of variability are also covered in great detail elsewhere.gs,gg Here we will concentrate on the design of relatively simple reliability studies. Consider an example in which one wishes to estimate variability between and within raters. Each member of a sample of n subjects (items) is rated once only by each member of a sample of N raters (methods). One might wish to consider raters as fixed (as in a method comparison study) but it is often of more interest to consider raters as a random sample drawn for a potentially much bigger population of possible raters. In this case we are concerned with the estimation of three variance components - for subject effects, for rater effects and variation arising from residual ’error’. In this design (that is, without replication) we cannot distinguish rater-by-subject interaction effects from measurement error. They are confounded. We will assume that there are no rater-by-subject interactions.

Downloaded from smm.sagepub.com at MCGILL UNIVERSITY LIBRARY on January 5, 2015

147

in estimation of the rater effect variance then it our study to, say, two particular raters. This far the most common is, however, by design used in practice. Sometimes one might have five or six raters, but very rarely over 10. The reasons for this are practical. In a chemical pathology laboratory the amount of material on which the measurements are to be made may be severely limited. In a psychiatric clinic a patient obviously cannot be interviewed repeatedly and may not tolerate the presence of more than one or two observers or raters (one of them assumed to be the interviewer). Similar conclusions would arise in the context of many direct physical examinations of the patients themselves. In the case of video or audio recordings of interviewers, however, or of the examination and rating of X-ray or photographic images, these restrictions would not necessarily apply. It still might be too time-consuming or expensive to get all raters to rate all possible items,

Now, if

does

we are

really interested

not seem very

sensible

to

restrict

however.

Suppose that we have N raters who are available to take part in a reliability study and is the number of raters who can feasibly rate any single item or suppose that k (

Design and analysis of reliability studies.

This review covers the design and analysis of essentially two types of reliability study: method comparison studies and generalizability (including in...
2MB Sizes 0 Downloads 0 Views