The Structured Clinical Interview for DSM-III-R (SCID). II. Multisite test-retest reliability.

The Structured Clinical Interview for DSM-III-R (SCID) II. Multisite Test-Retest

Reliability

Janet B. W. Williams, DSW; Miriam Gibbon, MSW; Michael B. First, MD; Robert L. Spitzer, MD; Mark Davies, MPH; Jonathan Borus, MD; Mary J. Howes, PhD; John Kane, MD; Harrison G. Pope, Jr, MD; Bruce Rounsaville, MD; Hans-Ulrich Wittchen, PhD \s=b\ A test-retest reliability study of the Structured Clinical Interview for DSM-III-R was conducted on 592 subjects in four patient and two nonpatient sites in this country as well as one patient site in Germany. For most of the major categories, ks for current and lifetime diagnoses in the patient samples were above .60, with an overall weighted k of .61 for current and .68 for lifetime diagnoses. For the nonpatients, however, agreement was considerably lower, with a mean k of .37 for current and .51 for lifetime diagnoses. These values for the patient and nonpatient samples are roughly comparable to those obtained with other structured diagnostic instruments. Sources of diagnostic disagreement, such as inadequate training of interviewers, information variance, and low base rates for many disorders, are discussed. (Arch Gen Psychiatry. 1992;49:630-636)

of the Structured Clinical Interview Thefordevelopment (SCID) described accompany¬ The of the attested DSM-III-R

is

in

an

ing report.1

SCID is to widespread use by the more than 100 published studies that have used the instrument to select or describe their study samples. In this

report, we will describe the method and results of a mul-

tisite test-retest reliability study of the SCID. In planning this study, a decision was made to include sev¬ eral sites that would represent a range of settings and include adult psychiatric patients and community subjects. In addi¬ tion to selecting a wide range of sites and subjects to increase

See also

624.

generalizability of the results of this study, the rigor¬ test-retest method for testing reliability was chosen. Researchers regard this method as "state-of-the-art" re¬ liability assessment. In this procedure, two clinicians independently interview and rate the same subject. The advantage of this method is that it approximates actual the

ous

practice, in which different clinicians are likely, even with

semistructured interview, to elicit different responses from the same subject (information variance). Results ob¬ tained with this method are usually lower but more generalizable to the real world than those obtained using the a

Accepted for publication July 24, 1991. From the Department of Psychiatry, Columbia University, New York, NY, and the New York State Psychiatric Institute (Drs Williams, First, and

Spitzer, Ms Gibbon, and Mr Davies); Harvard Medical School, Cambridge, Mass (Drs Borus and Howes); Department of Psychiatry, Hillside Hospital, Long Island Jewish Medical Center, Glen Oaks, NY (Dr Kane); Harvard Medical School, McLean Hospital, Belmont, Mass (Dr Pope); Department of Psychiatry, Yale University Medical School, Substance Abuse Treatment Unit, New Haven, Conn (Dr Rounsaville); and the Max Planck Institute for Psychiatry, Munich, Federal Republic of Germany (Dr Wittchen). Reprint requests to Biometrics Research Department, New York State Psychiatric Institute, 722 W168th St, New York, NY 10032 (Dr Williams).

"joint" method of reliability testing in which two raters in¬ dependently observe and rate the same interview.2 SUBJECTS AND METHODS

Description of Sites

Six sites in the United States were chosen to participate, including four psychiatric facilities, a health maintenance organization, and a site evaluating nonpatient subjects from the community. In addi¬ tion, a seventh site in Germany participated. Each is described briefly below. New York State Psychiatric Institute (NYSPI), in New York City, is a state-funded research and teaching hospital connected with the Department of Psychiatry at Columbia This institution provided inpatients and outpatients from a variety of clinical services, including a community-catchment area-based inpatient service, an outpatient depression evaluation service, specialized inpatient research units for eating disorders and de¬ pression, and a general outpatient psychiatric clinic. Long Island Jewish-Hillside Medical Center, in an outer bor¬ ough of New York City, is a private, nonprofit clinical training and research facility affiliated with the State University of New York at Stonybrook. This institution also provided patients from general inpatient units and outpatients from a mood disorder clinic and a clinic specializing in anxiety disorders. McLean Hospital, outside of Boston, Mass, is a teaching and research hospital affiliated with Harvard Medical School. Inpa¬ tients came from general clinic evaluation units, and outpatients were recruited through an eating disorders research program and a general outpatient clinic. The Substance Abuse Treatment Unit (SATU) in New Haven is part of the Connecticut Mental Health Center and is affiliated with the Yale University Department of Psychiatry. This facility pro¬ vided inpatients and outpatients who were seeking treatment for abuse of alcohol and a variety of other drugs. There were two nonpsychiatric-patient evaluation sites. One was the Harvard Community Health Plan, a health maintenance organization in suburban Boston that is associated with Harvard Medical School, and the other was the community surrounding the New York State Psychiatric Institute. The Max Planck Institute for Psychiatry in Munich, Federal Republic of Germany, is a research institution. It has a 50-bed inpatient department treating mostly patients with mood and psy¬ chotic disorders.

University.

Patient Selection In the initial phase of the reliability study, research assistants at the patient sites recruited subjects from the most recent admis¬ sions to all participating services. From the beginning, an effort

also made to include any patient admitted with a "rare" di¬ agnosis, for example, delusional disorder or a somatoform disor¬ der. Toward the end of the study, only patients with rare diagnoses were included to obtain large enough numbers in those diagnostic cells to assess reliability. Even with this effort, there were insufficient numbers for data analysis in several categories, was

such as Somatization Disorder and Delusional Disorder In ordinary clinical practice, prior to evaluating a patient, a cli¬ nician reads accompanying chart material, and then, before for¬ mulating a final diagnosis, often interviews informants. In this study, a decision had to be made about how much of this kind of

Downloaded From: http://archpsyc.jamanetwork.com/ by a Monash University Library User on 09/01/2013

information would be available to the interviewers. In many previ¬ ous test-retest reliability studies of diagnostic instruments, all of these materials were made available to the raters.2 A disadvantage of this method, however, is that the final diagnostic assessments are based not only on the results of the structured interview, but also on prior evaluations of variable thoroughness and the care with which each of the raters reviewed the chart material. We chose to conduct the most rigorous test of the SCID by limiting the informa¬ tion available to the interviewers to a brief summary of the hospital admission evaluation, prepared by a research assistant. Accordingly, after determining that a patient met the eligibil¬ ity requirements for the study (ie, age 18 years and older, fluent in English [for United States sites], and unlikely to have an organic mental disorder), research assistants obtained informed consent and, for inpatients, completed a SCID Inpatient Admission Sum¬ mary. This single-page form was intended to summarize the cir¬ cumstances of hospital admission, the number of previous hospi¬ tal admissions, the presenting problem, the history of present illness, and previous treatments. Research assistants were specif¬ ically instructed not to include diagnostic terms on this form. The SCID Inpatient Admission Summary was made available to both reliability interviewers prior to their evaluations of the patient. Its purpose was to provide interviewers with evidence of psychopa¬ thology that might not be reported by the patient during the in¬ terview. For example, a patient who was brought in by the police in handcuffs after shouting at passersby in the street might tell the interviewer only that he came into the hospital because he was very "nervous." The information on the SCID Inpatient Admis¬ sion Summary, however, would enable the interviewer to con¬ front the patient with the actual circumstances to probe for evi¬ dence of psychotic symptoms.

Nonpatient Selection—Harvard Community, Health Plan Site

Medical patients at Harvard Community Health Plan repre¬ sented recent users of nonmental health medical services. Within a few days of a routine medical visit, patients who met study eligi¬ bility requirements (specified above) were sent a letter requesting their participation in the study and asking them to complete a 28item version of the General Health Questionnaire.3 This screen was used to stratify the sample selected for the reliability study into three groups: patients with high General Health Questionnaire scores who would be likely to have a mental disorder, patients with low General Health Questionnaire scores and presumably no mental disorder, and an intermediate group.

Nonpatient Selection—NYSPI

Site

This sample was recruited from people who responded to notices posted in the community and advertisements placed in community newspapers. The notices solicited volunteers who were "worried, depressed, nervous" or had unexplained physical symptoms," but who had not received any kind of mental health treatment in the past 2 years. Potential volunteers were screened by the telephone to make sure they met the eligibility requirements for the study, prior to being scheduled for the first interview. At both nonpatient sites, subjects were paid a nominal amount for their time after the second interview was completed. At all sites, patient and nonpatient, the second reliability interviews were to be administered at least 24 hours after the first interview, but no longer than 2 weeks later. "

The

project

was

PROJECT PHASES designed in four phases. Phase 1 began in May

1985, before the criteria for DSM-III-R reason, the first 3 months of the

were

finalized. For this

project were devoted to updat¬

ing the SCID questions (which had originally been developed for DSM-III) to correspond to the evolving DSM-III-R criteria. Dur¬ ing this time, the SCID was continually tested in joint interviews by the senior project staff at NYSPI. Although the original time¬ table for DSM-III-R called for the criteria to be finalized by 1985, revisions of the criteria continued, in fact, until December 1986.

Therefore, by necessity, revisions of the SCID

were being made throughout the first 6 months of the reliability study. The second phase of the project began in July 1985 with a train¬ ing session for the senior staff from all of the sites in preparation for broader testing and refining of the SCID in these settings. Dur¬ ing this additional pilot work, joint interviews conducted by the investigators were audiotaped, and feedback was given to the

NYSPI staff about needed revisions. These were discussed in monthly conference calls among the collaborating sites. Phase 3 began with the selection and training of interviewers. All participating interviewers were recruited from within each site. All were mental health professionals, with a master's degree (n=5), PhDs (n=6), or MDs (n=14). Their backgrounds ranged from a 1-year internship in a PhD program to many years of re¬ search experience and clinical practice. A few had had extensive prior experience with structured diagnostic interviews, but most had not. Some initial training was done at each site by the senior staff prior to a 2-day plenary training session of interviewers from all of the sites. During this 2-day session, much time was spent discussing possible revisions in the SCID questions and details of the reliability protocol. Unfortunately, although these discussions were necessary, they took away from the time available for observation and supervision of the trainees' interviewing tech¬ niques. Following this training session, interviewers returned to their home sites to conduct a series of pilot interviews that were audiotaped and sent to NYSPI for feedback. For the Munich site, the SCID was translated into German by one of us (H.U.W.) and his team and back-translated by two ex¬ perienced bilingual research assistants, according to guidelines for translation of documents by the World Health Organization. Following the back-translation, inconsistencies and problems were resolved by the entire research team. Subsequent revisions of the SCID were not back-translated. Phase 4 began in August 1986, when the actual test-retest inter¬ views were underway in all of the sites. Except in those cases in which the subjects refused, all interviews were audiotaped for sub¬ sequent monitoring and post hoc review. Within 3 days after com¬ pletion of the second of each pair of independent assessments, the interviewers conferred to determine reasons for diagnostic dis¬ agreement and to arrive at consensus diagnoses. To document the reasons for diagnostic disagreements, a Reliability Assessment Re¬ port Form was completed by the interviewers for each case. The entire case, includingthe Reliability Assessment Report Form, was then sent to NYSPI for data editing. If the reasons for disagreement were unclear, the audiotapes were reviewed by NYSPI staff. This monitoring resulted in (1) feedback to the interviewers about er¬ rors in their use of the SCID and (2) identification of problems and inconsistencies in the SCID that led to changes in questions and format. In all, six different versions of the SCID were used in the test-retest interviews. One of the criticisms of the DSM-III field trials was that colleagues working together in the same facilities were more likely to agree because they were more likely to have shared sim¬ ilar training, experience, and theoretical orientation toward diag¬ nosis. To test this hypothesis, all of the sites, except for S ATU, sent raters to other sites in the United States where they were paired with interviewers from the host site.

RESULTS Five hundred

ninety-two subjects were interviewed;

of these,

390 were patients and 202 were nonpatients. Sixteen of the 25 rat¬ ers participated in the cross-site study, evaluating 98 cases that included both patients and nonpatients. Twenty-five of these cases were evaluated by the two German raters, paired with four raters at McLean. Demographic characteristics of the sites are presented in Table 1. (In all of the tables, cases with missing data are excluded from the calculations. Missing data never exceed 6% of the cases.) The gender ratios reflect the often-observed fact that more women than men seek mental health and primary care ser¬ vices. The exception to this is the SATU site, where there is an ex¬

pected predominance of men, given the admission requirement of a substance use disorder. Most subjects across sites were white;


Patients

LIJH No. of

subjects

Mean age, y Gender, F, %

Ethnicity, white,

%

Nonpatients SATU

Munich

NYSPI-NP

HCHP

50

84

102

100

31

34

33

41

61

58

73

NYSPI

McLean

100

56

100

35

33

33

54

76

56

39

85

72

98

65

100

72

98

57 26 17

76 9 15

60 19 21

6 43 51

54 21 25

82 17 1

63 27 10

SES,t % 1 and 2 3 4 and 5

69 56 100 2 100 Inpatients, % 31 44 0 98 0 Outpatients or day patients, % *LIJH indicates Long Island Jewish-Hillside Medical Center; NYSPI, New York State Psychiatric Institute; SATU, Substance Abuse Treatment Unit; NP, nonpatient, and HCHP, Harvard Community Health Plan; tHollingshead-Redlich classes. SES indicates socioeconomic status.

nonwhite subjects were primarily black, except for a substantial number of Hispanics at the NYSPI sites. Classes 1 and 2 in all of the sites, except for SATU, are overrepresented, as is often the case in patients recruited at research and teaching facilities. Table 2 presents the levels of agreement for each current and life¬ time diagnosis in the patient samples. Table 3 presents the same in¬ formation for the nonpatient samples. Current disorders are those that were present at any time during the month prior to the inter¬ view. Lifetime diagnoses are those that were present at any time in the subject's life and therefore include current disorders. Agreement between interviewers is expressed in terms of , a sta¬ tistic that corrects for chance agreement.4 To avoid presenting ks that are likely to be unstable, it is calculated for only those categories that were diagnosed a minimum of 10 times altogether by either rater (regardless of their agreement on individual cases). Therefore, some SCID categories that were rarely diagnosed are not presented in either table. Subjects may be counted in more than one diagnostic category, if multiple diagnoses were given. The overall weighted summarizes the for all diagnoses (in¬ cluding those categories that were diagnosed less than 10 times altogether by either rater) by weighing each diagnostic category by the total number of cases given a diagnosis by either rater. The numerator is the sum of the product of the diagnostic weight and for each diagnosis; the denominator is the sum of the weights. For individual diagnoses, diagnostic classes, or overall weighted for the entire set of diagnoses, a high value (generally .7 and above) indicates good agreement; ks ranging from .5 to .7 generally indicate fair agreement, and below .5 is poor. For most of the major categories (bipolar disorder, major depression, schizophrenia, alcohol abuse/dependence), ks for current and lifetime diagnoses in the patient samples were above .60, with a mean of .61 for current, and .68 for lifetime diagnoses for the combined samples. For the nonpatients, however, agree¬ ment was often considerably lower, with a mean of .37 for cur¬ rent and .51 for lifetime diagnoses for the combined samples. These tables illustrate large differences in diagnostic agreement for individual diagnoses across the patient and the nonpatient sites. There are also considerable differences in some of the over¬ all weighted ks for the different patient sites. The SATU site, specializing in treatment of alcohol and drug dependence, enough patients to assess reliability of specific drug dependence diagnoses (Table 4). The only two drug categories for which the ks were below .60 were cannabis depen¬ dence and polydrug dependence. The results of the cross-site substudy (Table 5) indicated that when the American raters were paired with raters from another site, the diagnostic agreement was not lower than when they were paired with raters from their site. As might be expected, the for the German interviewers paired with American raters evaluating the conditions of American patients at the McLean site was lower than that between the German raters at their site evaluating the con-

provided

ditions of German patients. However, with these sample sizes, this difference was not statistically significant. The Global Assessment of Functioning Scale, summarizing the subject's functioning for the worst week during the month prior to the interview, was also completed by each interviewer. Agree¬ ment (intraclass correlation coefficients) on the Global Assess¬ ment of Functioning Scale is presented in Table 6. With the exception of the Munich site, the agreement is high, even in the two

nonpatient samples.

above, we had intended to oversample for rare diagnoses. Although we were successful for some of these cate¬ gories, for others, such as Somatization Disorder in both the pa¬ tient and community samples and Hypochondriasis and Gener¬ alized Anxiety Disorder in the community samples, there were insufficient cases given the diagnosis to calculate ks. As mentioned

COMMENT assumed that the reliability associated with It is typically a structured interview is an intrinsic property of the instru¬ ment itself (eg, "How reliable is the SClD?"). In fact, the re¬ liability of any interviewer-administered instrument is a function of many factors, including the reliability of the di¬ agnostic criteria themselves, the characteristics of the inter¬ viewers (eg, their training, motivation, experience, interper¬ sonal style), the study method (eg, test-retest or joint inter¬ views, sources of information available, live or videotaped interviews), and the characteristics of the subject sample (eg, base rates of diagnoses, frequency of subthreshold cases). For this reason, comparisons of ks from different reliability studies may be misleading and should be made with caution. With this caveat, Tables 7 and 8 provide the ks obtained in several major reliability studies, including the present study of the SCID. In Table 7, the SCID-Patient Version reliability is given, as well as that of the Schedule for Affective Disorders and Schizophrenia / Research Diagnostic Criteria (SADS / RDC),2 Diagnostic Interview Schedule (DIS)5 patient sample, the DSM-III field trial,6 and the Composite International Diag¬ nostic Interview,7 a modification of the DIS. Although the re¬ liability studies reported in Table 7 differed greatly in many respects (diagnostic criteria, thoroughness of training of in¬ terviewers, timing of retest interviews, stability of instru¬ ment, etc), investigators may find these reliability data useful in helping them evaluate published reports of research al¬ ready done. As can be seen, the SCID-P ks are similar to those obtained with the SADS/RDC, with the exception of current major depressive disorder and alcohol abuse/dependence.


Bipolar disorder Major depression

Dysthymia

Schizophrenia Schizophreniform disorder Schizoaffective disorder Delusional disorder

LIJH

NYSPI

McLean

SATU/

(n=100)

(n=56)

(n=100)

(n=50)

.64 (.64) 8% (8%)

8% 8%)

36% 38%)

34% (42%)

.54 .61) 38% 53%)

27% 30%)

.37 (. ..) 12% (. .)

.29 10%

.62 (.65)

.

.65 (.69) 17% (18%)

...(...) 1%(1%) .43 (.43) 7% (7%) 5% (5%)

Brief reactive

.87 .89) .71 .80)

0% 0%)

8% 3%)

3% 3%)

3% 3%)

0% 0%)

.) 0% 0%)

0% 0%)

Alcohol abuse/

.48 (.64) 4% (26%)

.66 .68) 3% 1 7%)

.66 .73) 3% 30%)

.63 (.83) 9% (37%)

.73) 0% 21%)

Agoraphobia without panic disorder

Social

phobia

Simple phobia

(.52) (20%) ...(...) 2% (2%) .47 (.39) 6% (7%) (...) 3% (3%) .63 (.65) 9% (13%) ...(...) 3% (. .) ...(...) 0% (0%) ...(...) 2% (. .) ...(...) 0% (3%) ...(...) 2% (4%) ...(...) ...

Obsessive-compulsive disorder Generalized disorder

anxiety

Somatization disorder

Hypochondriasis Anorexia

Bulimia

7% 12%)

2% 3%)

1% 2%)

.63 .76) 11% 18%)

2% 2%)

.74)

nervosa

Adjustment disorder

...)

(. ..) 4% 6%)

.71) 4% 14%)

3% .56

10% 0%

1%

.

0%

.

nervosa

.76) 4% 26%) .64 .62) 8% 12%)

0%

4% .90 25%

.84 (.84) 14% (14%)

.82 .77) 28% 39%)

.64 (.69) 31% (39%)

84)

0%) .) 0%) 89) 21%) 83) 29%)

1% 0% 1% .83

10%

(. ..) 3%) .65) 13%) ...) ..) ..) ..). ..) 0%) ..) 5%) .89) 16%) .

2% (. .) 0% 0%) 0% 0%) .49 (.58) .53 (.69) .62 (.69) Overall weighted *LIJH indicates Long Island Jewish-Hillside Medical Center; NYSPI, New York State .

We suspect that two factors contributed heavily to the better agreement obtained for current major depressive disorder in the study with the SADS. First, the SADS/RDC was devel¬ oped for use in the Collaborative Study of the Psychobiology of Depression, and the test/retest interviewers were partic¬ ularly focused on mood disorders and were well-trained in the diagnosis of Major Depression. It is not surprising that they agreed on the diagnosis of current major depressive dis¬ order more frequently than the generalist clinicians who par¬ ticipated in the SCID study. In addition, the RDC and DSMIII-R criteria for this category are different: while the RDC criteria for definite major depressive disorder require 2 weeks' duration, as do the DSM-III-R criteria, they do not require that each symptom be present "nearly every day," as

5%

0% 0%)

.

.69 .73) 13% 16%)

.40 (. .) 6% (. .) .

.

.65 (.68) 11% (11%)

.59 (.59) 1%(1%) .63 (.63) .) 0% 0%) 4% 4%) 6% (6%) .69 (.69) .) 0% 0%) 5% 5%) 3% (3%) .) .) ...(...) 0% 0%) 0% 0%) 0% (0%) .73 71) .00 .87) .75 (.73) 14% 47%) 5% (26%) 7% 16%) .73 ..) .84 .81) .84 (.85) 87% 100%) 8% 20%) 16% (36%) .38 .28) .58 (.54) ...) 1% 1%) 8% 10%) 9% (12%) .57 .54) .43 (.48) ...) 1% 1%) 6% 10%) 2% (4%) .54 .55) .47 (.57) ...) 4% 8%) 10% 12%) 6% (9%) .64 .67) .52 (.60) (. ..) 7% 7%) 9% 10%) 5% (5%) .82 .72) .59 (.67) 0% 1%) 7% 10%) 7% (11%) .56 ( ..) 2% ..) 3% 2%( ..) ...( 0% 0%) 2% 1%( .57 ( ..) 0% 0%) 2% 1%(. .72 (.84) ...) 75) 0% 0%) 2% 3%) 1 % (7%) .82 92) .86 (.87) 0% 0%) 8% (11%) 7% 3%) ...(...) 0% 0%) 1% .) 1%(. ..) .67 (.55) .62 (.66) .61 (68) Treatment Unit. Substance Abuse Institute; SATU, Psychiatric

.)

2% 2%) .76 .76) 10% 10%)

...(...)

.53 1 7%

.92

8% 9%)

..) 7%

.49 49)

0% (0%)

Panic disorder

Total Sample (N=390)

.)

..) ..) 4% 4%) .) 0% 0%)

psychosis dependence Other drug abuse/ dependence

0% 0%) .37 .53) 33% 38%)

Munich/ (n=84)

0% 0%)

3% 3%)

•

.

.

required in DSM-III-R. Therefore, to agree on a diagnosis of major depression using the SCID /DSM, the pair of raters must both determine that each of the five required symptoms were present nearly every day for at least 2 weeks, a more stringent and complex criterion than in the SADS / RDC. The two studies also differed in their definition of "agreement," the SADS/RDC study including cases that met either "prob¬ able" or "definite" criteria. With the SCID/DSM, there is only one diagnostic threshold, corresponding to definite di¬ agnoses. In addition, the SADS/RDC study was conducted in 1975, when the clinical picture was not complicated by the use of illicit drugs for as many psychiatric patients as was true in 1986, when the SClD reliability study was conis


NYSPI-NP (n=102)

Bipolar disorder

HCHP

(n=100)

Disorder

TotalSample (n=202)

Alcohol dependence Sedative dependence Cannabis dependence Stimulant dependence

.

Major depression

0% ( 1 %)

0% (0%)

.52 (0.53) 5% (29%)

.26 (.44) 4% (23%)

Dysthymia

.55 8%

0% ( 1 %) .42 (.49) 4% (26%)

Opioid dependence dependence Polydrug dependence

.53 6%

...

4%

Cocaine

Alcohol abuse/

dependence

(.78) 0%(19%) (.90) 2% (27%) .

Other drug abuse/ Panic disorder

Agoraphobia without panic disorder

phobia

(.74)

.

4% (26%) .

.

(.77)

.

.

.

.

.

.

.

.

.

.

(3%)

1%

(.39) 3% (9%)

(.04) 4% (8%) (.54) 6% (7%)

.41 (.22)

.

anxiety

.

1 % (4%)

.

.

.

.

(.39)

2% (7%)

Obsessive-compulsive

.

.

.

.

(.30)

4% (6%)

.

.

.

.

.

.

.

0% (2%)

3% (9%) .48 (.46) 4% (7%) ...

(.33)

2% (4%)

3%

1%

2%

0%

0%

0%

0%

0%

0%

0%( 1 %)

0% (0%)

0% ( 1 %)

0%(2%)

0%(0%)

0%(1%)

.29 6%

.19 5%

Somatization disorder

Hypochondriasis Anorexia

Bulimia

nervosa

nervosa

Adjustment disorder Overall

weighted

...

3% .38 (.50)

.32

(.49)

.37 (.51)

*NYSPI indicates New York State Psychiatric Institute; NP,

and HCHP, Harvard

Community Health

Plan.

nonpatient;

ducted. The subject selection procedure for the SADS/RDC study also allowed research assistants to exclude patients who used drugs heavily. In the SCID study, many of the sub¬ jects included had histories of heavy drug use; only those for

whom the most likely current diagnosis was a psychoactivesubstance-induced organic mental disorder (eg, cocaineinduced delusional disorder) were excluded. The higher in Table 7 for SADS/RDC alcohol abuse and dependence may be due to differences between the RDC and DSM-III-R criteria. The RDC diagnoses are made by choosing any two of a list of 20 symptoms (eg, "others complain of his drinking," "frequent blackouts"). In con¬ trast, the DSM-III-R criteria themselves are much more

complex (eg, "continued substance use despite knowledge of having a persistent or recurrent social, psychological, or physical problem that is caused or exacerbated by the use of the substance") and require the rater to judge one of only two symptoms as present for abuse, and three of nine symptoms for dependence.

In the SADS/RDC reliability study, raters were encour¬ aged to consult all sources of information, including talking

42% (13%) 27% (8%)

...

26% (8%)

.62...

.

20% (1%)

.78 (.95) .78 (.83)

88% (71%)

0.22

.

.

74% (50%)

0.34

...

.

2% (18%)

.64

10% (0%) *SATU indicates Substance Abuse Treatment Unit.

(.76) 2% (23%) .

1 % (2%) .

Simple phobia disorder Generalized disorder

.

(.85) 2% (23%) .59 (.65) 1 % (4%) 3% (6%) 2% (5%) .

dependence

Social

.

.83 (.73)

to the patient's therapist, before making their final diagnostic

decisions. Two of us (M.G. and J.B.W.W.) were interviewers in the SADS/RDC study and have vivid memories of time spent reading carefully through voluminous charts describ¬ ing past hospitalizations. In contrast, the SCID interviews were

at most supplemented only by the brief hospital admis¬

sion summaries prepared from the admission notes by a re¬ search assistant. Occasionally, these notes included informa¬ tion about past episodes, but usually not. Therefore, the di¬ agnostic decisions of the SCID interviewers were much more vulnerable to the subjects telling different stories to each in¬ terviewer (information variance). In fact, information vari¬ ance was cited most frequently on the Reliability Assessment Report Form as the source of disagreement between raters in the SCID study. The DIS reliability figures were obtained by comparing two interviews done by lay interviewers, one using the standard DIS and the other using a computer-prompted version of the DIS. This study was chosen for comparison purposes because it is the only test-retest reliability study of the DIS using lay interviewers for both interviews. As can be seen in Table 7, the reliability figures are compara¬ ble to those obtained in our SCID study, except for the cat¬ egories of Bipolar Disorder and Schizophrenia, in which the DIS ks are considerably lower. It is surprising, in fact, that the reliability for the DIS is not higher. The DIS is a fully structured interview in which the major source of unreliability in test-retest interviews is information vari¬ ance in which a subject answers "yes" to one interviewer's question and "no" to the same question when asked by the other interviewer.8 In contrast, the SCID is a semistructured interview that allows for considerable variation in interviewing style, depth of probing, and clinical judgment

to whether a patient's description of a particular behav¬ ior meets the relevant diagnostic criterion. Table 8 presents the levels of diagnostic agreement ob¬ tained in samples of subjects who are not identified as psy¬ chiatric patients. In the case of the SCID-Nonpatient Version, this includes community subjects who volunteered in re¬ sponse to an advertisement for people who had symptoms but no recent treatment and primary care patients at a health maintenance organization. The DIS sample consisted of community subjects who were selected for the retest inter¬ view because had been given a diagnosis by the first ("test") interviewer, as well as a subsample with no diagno¬ sis. The retest interviews were all administered by physi¬ cians using the DIS. The values are considerably for the SCID than for the DIS for all of the diagnoses noted in Table 8. In our study, as might be expected, agreement for every as

they


higher

Cross Site

On Site

-1

All American

With

a German Rater (n=25)

Raters (n=73) .57 (.67)

-1

All American

All Raters (n=98)

All German Raters (n=84)

All Raters

Raters (n=410)

.55 (.66)

.57 (.63)

.62 (.66)

.59 (.64)

.47 (.57)

(n=494)

LIJH (n=100)

NYSPI-P

McLean

SATU

Munich

NYSPI-NP

HCHP

(n=156)

(n=100)

(n=50)

(n=84)

(n=102)

(n=100)

.69

.82

.62

.78

.47

.75

.69

*LIJH indicates Long Island Jewish-Hillside Medical Center; NYSPI, New York State ment Unit; NP, nonpatient; and HCHP, Harvard Community Health Plan.

Bipolar disorder Lifetime

Current

Major depression

Lifetime Current Lifetime schizophrenia Alcohol abuse and Lifetime

dependence

Current

Drug abuse and dependence Lifetime Current

Psychiatric

SCID-P DSM-III-R

SADS RDCt

DIS DSM-lllt

(n=390)

(n=60)

(n=76)

.84 .84

.77 .82

.54

.64 .69

.71 .90

.61

.68

.73

.48

.73 .75

.951 1.001

.76

•73||

.74

.84 .85

Institute; P, patient; SATU, Substance Abuse Treat¬

Unstructured DSM-III§ (n=131)

CIDI DSM-III

(n=60) 56

47

62

66 52

81

60 79

.65

(dependence only)

•9211

.73 .66

(dependence only) Panic disorder .58 .58 .84 Lifetime .73 .54 Current *SCID indicates Structured Clinical Interview for DSM-III-R; P, patient; SADS, Schedule for Affective Disorders and Schizophrenia; RDC, Research Diagnostic Criteria; DIS, Diagnostic Interview Schedule; and CIDI, Composite International Diagnostic Interview; tProbable plus definite; *Lay vs computer prompted; §DSM-III Field Trial; phase 1 only; phase 2 data not available; | Probable plus definite are roughly equivalent to abuse and dependence in the RDC.

diagnosis in nonpatients was lower than that for patients. This is in part due to the fact that the base rate for all of the in the nonpatients was considerably lower than diagnoses for the patients. For example, the base rate for current major depression in the patients was 31 %; for the nonpatients it was 4%. As noted by Shrout et al,9 if the error of measurement for a diagnostic instrument is constant, reliability varies directly with the base rate (ie, it is harder to obtain good reliability for a rare diagnosis than for a common diagnosis). Prior to the study, we had expected higher reliability val¬ ues. In particular, we had expected that they would be con¬ siderably higher than those obtained in the DSM-III field tri¬ als that did not employ structured interviews. We are at a loss to explain why this was not the case. We doubt that the DSM-III criteria were clearer than the DSM-III-R criteria. Critics of the DSM-III field trial methodology have sug¬ gested that many of the test-retest ratings may not have been arrived at completely independently.1" As we have dis¬ cussed elsewhere, we think this is unlikely.11 It is of interest that the International Classification of Diseases (10th revision) (JCD-20) field trials of draft criteria for ICD-10, which em-

ployed a similar methodology, yielded results comparable to the DSM-III field trial (Darrel A. Regier, MD, oral and written communication, May 22, 1991). Perhaps clinicians with an already-high level of expertise and commitment to using di¬ agnostic criteria do not improve their reliability by using a structured interview that has the flexibility of the SCID. In fact, we know of no study that, across a large sample of pa¬ tients, directly compared the reliability of diagnostic criteria

with and without a structured interview. Several features of this study may have compromised the diagnostic reliability that was obtained. These included in¬ sufficient training of raters and changes made in the instru¬ ment as the study progressed. We therefore suspected that the reliability would be better with the later versions that in¬ cluded some revisions in questions and items. However, the ks for the first half of the interviews did not differ from those for the second half. Perhaps the most important feature of this study that may have lowered diagnostic reliability is the absence of a focus on a particular diagnostic group. In most research studies,


SCID-NP

(N=202) L/T

bipolar disorder

Major depression Lifetime

Current

DIS

feria on the basis of clinical hunches without adequate data. In fact, a review of a sample of more than 50 of the audiotapes revealed that the information variance cited as the reason for diagnostic disagreement was almost always the result of one interviewer's acceptance of a yes response, while the other interviewer asked enough follow-up ques¬ tions to determine that an initial yes did not meet the cri¬ terion. This kind of unreliability can presumably be

(Version 3) (N=370) .25

...

.49 .42

L/T alcohol abuse and

.33

improved with training.

...

dependence drug abuse and dependence

.76

.68

.85

.70

panic disorder

.65

.28

L/T

L/T

*SCID indicates Structured Clinical Interview for DSM-III-R; NP, patient; DIS, Diagnostic Interview Schedule; and L/T, lifetime.

non¬

subjects are prescreened for likely diagnoses so that the base rates in the study sample are high for the diagnoses of in¬

terest, and the interviewers are very familiar with those di¬ agnostic categories. For example, in a study of anxiety dis¬ orders using a version of the SADS that focuses on anxiety disorders (SADS-LA), high ks were obtained12 for some of the anxiety disorders. Subjects in this study had been screened for likely anxiety disorder diagnoses by a research assistant in a telephone interview. A reliability study con¬ ducted in such a setting is likely to yield higher levels of agreement than a reliability study conducted in an unselected diagnostically heterogeneous sample because of the higher base rates for individual diagnoses and the special¬ ized training of the interviewers in evaluating these disor¬ ders. In the one site in our study, SATU, in which the subjects were enrolled because of a likely diagnosis of substance de¬ pendence, the ks on individual drug use disorders were higher (Table 4). In contrast, in the other sites in our study, there was no prescreening and the interviewers did not focus on a small number of diagnoses. Two studies illustrate that reliability is likely to be con¬ siderably higher with the SCID when the focus is on one or two disorders, rather than on all of the SCID diagnosable disorders as was the case in this field trial. Riskind et al,13 using videotapes of 75 SCID interviews, obtained ks of .72 and .79 for generalized anxiety disorder and current major depression. Williams et al,14 in test-retest interviews on 72 patients at 13 international sites of the Upjohn Cross-National Collaborative Panic Study, obtained a of .87 for the diagnosis of current panic disorder. A feature of the study that may have artificially raised di¬ agnostic reliability is discussion of each case after the retest. However, the fact that diagnostic reliability did not improve over the course of the study makes this unlikely. (This tech¬ nique of discussing sources of disagreement after a case is competed was also used in the SADS/RDC study.2) We had hoped that the structure of the SCID would en¬ able experienced clinicians with minimal training to use it reliably. It may be, however, that the flexible nature of the SCID and its heavy reliance on clinical judgment, features that make the instrument so user friendly, set an upper limit on its reliability. Maximizing reliability with the SCID clearly requires extensive training in the intent of the var¬ ious diagnostic criteria and insistence that interviewers elicit descriptions of behavior to justify each criterion coded as "present." Specifically, a crucial component of training is to teach interviewers not to take shortcuts and rate cri-

CONCLUSION The SCID was designed to be an efficient, user-friendly, and reliable clinical interview guide for making DSM-III-R diagnoses. That it is efficient and user friendly appears to be affirmed by the large number of studies that have used and are using the SCID, even before it has been formally intro¬ duced in the literature. The reliability of the versions of the SCID that were used in this study is roughly similar, across categories, to that obtained with other major diagnostic in¬ struments, particularly when allowance is made for differing methodologies in different studies of the reliability of the various instruments. Because the SCID relies on clinical judgment, its reliability is highly dependent on the training and skills of the interviewer, as well as the base rates of the disorders in the samples that are being studied. Therefore, investigators using the SCID should pay careful attention to adequate training of their interviewers and should conduct studies to determine the reliability of their interviewers in their particular setting. The following colleagues assisted in this study: Robert Aranow, MD; Ronald Winchel, MD; Janet Lavelle, MSW; lames I. Hudson, MD (site co-director); Thomas Kosten, MD (site co-director); Delbert Robinson, MD; Susan Insali, MSW; Michael Zaudig, MD (site co-director); Paul E. Keck, MD; Susan L. McElroy, MD; Richard Rumler, MD; Steven Wager, MD; Cynthia Berry-Morgan, MA; Ronnie Barnes, MA; Ruth Rosenberg, PhD; Niel Devin, PhD; and Cheryl Cohen. References 1. Spitzer RL, Williams JBW, Gibbon M, First MB. The Structured Clinical Interview for DSM-III-R, I: history, rationale and description. Arch General

Psychiatry. 1992;49:624-629. 2. Spitzer RL, Endicott J, Robins E. Research Diagnostic Criteria. Arch Gen Psychiatry. 1978;35:773-782. 3. Goldberg DP, Hillier VF. A scaled version of the General Health Questionnaire. Psychol Med. 1919;9:139-145. 4. Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed. New York, NY: John Wiley & Sons Inc; 1981. 5. Greist JH, Klein MH, Erdman HP, Bires JK, Bass SM, Machtinger PE, Kresge DG. Comparison of computer-and-interviewer-administered versions of the Diagnostic Interview Schedule. Hosp Community Psychiatry. 1987; 38:1304-1311. 6. Williams JBW,

Spitzer RL. DSM-III field trials: interrater reliability and list of project staff and participants. In: DSM-III Diagnostic and Statistical Manual of Mental Disorders. 3rd ed. Washington, DC: American Psychiatric Press Inc; 1980. 7. Semler G, Wittchen HU, Joschke K, Zaudig M, von Geiso T, Kaiser S, von Cranach M, Pfister H. Test-retest reliability of a standardized psychiatric interview (DIS/CIDI). Eur Arch Psychiatry Neurol Sci. 1987;236:214-222. 8. Robins LN, Helzer JE, Croughan J. National Institute of Mental Health Diagnostic Interview Schedule: its history, characteristics, and validity. Arch Gen Psychiatry. 1981;38:381-389. 9. Shrout PE, Spitzer RL, Fleiss JL. Quantification of agreement in psychiatric diagnosis revisited. Arch Gen Psychiatry. 1987;44:172-177. 10. Kutchins H, Kirk SA. The reliability of DSM-III: a critical review. Soc Work Res Abstracts. 1986;winter:3-12. 11. Williams JBW, Spitzer RL. DSM-III and DSM-III-R:a response. Harvard Med School Mental Health Letter. 1988;5:3-5. 12. Mannuzza S, Fryer AJ, Martin LY, Gallops MS, Endicott J, Gorman J, Liebowitz MR, Klein DF. Reliability of anxiety assessment. Arch Gen Psychiatry. 1989;46:1093-1101 13. Riskind JH, Beck AT, Berchick RJ, Brown G, Steer RA. Reliability of DSM-III diagnoses for major depression and generalized anxiety disorder using the Structured Clinical Interview for DSM-III. Arch Gen Psychiatry. 1987;44:817-820. 14. Williams JB, Spitzer RL, Gibbon M. International reliability of a diagnostic intake procedure for panic disorder. Am J Psychiatry. 1992;149:560-562.


The Structured Clinical Interview for DSM-III-R (SCID). I: History, rationale, and description.

Interrater reliability of the Structured Clinical Interview for DSM-III-R, Axis II: schizophrenia spectrum and affective spectrum disorders.

THE STRUCTURED CLINICAL INTERVIEW FOR COMPLICATED GRIEF: RELIABILITY, VALIDITY, AND EXPLORATORY FACTOR ANALYSIS.

Inter-rater reliability and acceptance of the structured diagnostic interview for regulatory problems in infancy.

Reliability and validity of the structured interview for personality disorders in adolescents.

The Structured Clinical Interview for DSM-IV Childhood Diagnoses (Kid-SCID): first psychometric evaluation in a Dutch sample of clinically referred youths.

Reliability and learning from the objective structured clinical examination.

A semi-structured clinical interview for the assessment of diagnosis and mental state in the elderly: the Geriatric Mental State Schedule. I. Development and reliability.

Neurology objective structured clinical examination reliability using generalizability theory.

An examination of the examinations: the reliability of the objective structured clinical examination and clinical examination.

Assessment of Semi-Structured Clinical Interview for Mobile Phone Addiction Disorder.

[The Basel Interview for Psychosis (BIP): structure, reliability and validity].

A structured diagnostic interview for hypochondriasis. A proposed criterion standard.

Age differences in the reliability of parent and child reports of child anxious symptomatology using a structured interview.

A structured interview version of the Hamilton Depression Rating Scale: evidence of reliability and versatility of administration.

Comparison of self-report and structured clinical interview in the identification of depression.

Diagnosing dissociative disorders in The Netherlands: a pilot study with the Structured Clinical Interview for DSM-III-R Dissociative Disorders.

The validity and reliability of the observed clinical interview (OCI) exams.

The Structured Clinical Interview for DSM-III-R Dissociative Disorders: preliminary report on a new diagnostic instrument.

Investigating the Attitude of Graduate Psychiatrists towards Objective Structured Clinical Examination (OSCE) and Conventional Clinical Interview Examination.

Reliability analysis of the objective structured clinical examination using generalizability theory.

Improving reliability of a residency interview process.

Reliability and validity of the objective structured clinical examination in paediatrics.

The structured interview as a tool for predicting premature withdrawal from medical school.