Original Articles Agreement Among Multiple Observers on Endoscopic Diagnosis of Esophageal Varices Before Bleeding FLEMMING BENDTSEN,'LENE THEILSKOVGAARD? THORKILD I.A. SBRENSEN' AND PETER MATZEN~ Departments of 'Medical Hepatology and 'Gastroenterology, Hvidovre Hospital and 3Statistical Research Unit, University of Copenhagen, Denmark

The interobserver variation in diagnosis and grading of esophageal varices may be ascribed by characteristics of the observers as well as to the patients. Assessment of this variation therefore requires the contributions of multiple observers and patients. Twenty-eight patients with cirrhosis without previous bleeding or known presence of varices were subjected to upper gastrointestinal endoscopy. Each endoscopy was videotaped and shown to 22 endoscopists. The varices were graded on a scale of 0 to 3 according to size. Each endoscopist diagnosed varices in 8 to 20 patients (mean = 15.9). Overall agreement on the presence (grades 1 to 3) or absence (grade 0) of varices was 70%. The average K value was 0.38 (standard deviation = 0.16). Discrimination between varices graded 0 to 1 and varices graded 2 to 3 gave a higher K value (p < 0.01) of 0.52 (standard deviation = 0.17). There was a large variation in K values (range = -0.025 to 0.975). No significant correlation was observed between K values for the two dichotomies (range = 0.16). The K values were not related to the experience of the endoscopist. Considerable variation in the agreement on diagnosis and grading of esophageal varices was found. These results must be taken into account in the assessment of trials of prophylaxis of first-time variceal bleeding. (HEPATOLOGY 1990;11:341-347.)

The severity of first-time variceal bleeding and the poor results of its management have led to a great interest in prophylaxis (1).Diagnosis of esophageal varices is usually more difficult before than after bleeding because the varices tend to be smaller and the endoscopist is not guided by bleeding. The possible side effects and complications of the prophylactic treatment increase the demand €or reliability of diagnostic procedures. Assessment of variceal presence and size, which may have prognostic value, is important (2, 3). Earlier studies have shown a large interindividual

Received April 5, 1989; accepted September 6, 1989. This study was supported by grants 12-6322 from the Danish Medical Research Council. Address reprint requests to: Flemming Bendtsen, M.D., Department of Medical Hepatology, Hvidovre Hospital, DK-2650 Hvidovre, Denmark. 31/1/18059

variation in the detection of esophageal varices by Xray films (4) or rigid endoscopy (5).A few recent studies using flexible endoscopes (3, 6, 7) have dealt with the observer variation. All published studies (3, 6) have been carried out in patients selected on the basis of presence of varices on a previous endoscopy or previous bleeding episodes believed to be of variceal origin. Because the observer variation may depend on the skill of the observer, studies including few observers may be subject to sampling error. Our purpose was to investigate the observer variation in a large number of endoscopists in the assessment of presence and size of esophageal varices. We used flexible endoscopy in patients who were candidates for prophylactic treatment of varices. PATIENTS AND METHODS Study Design. Upper gastrointestinal endoscopy was performed on 28 patients with cirrhosis. None of the patients had experienced upper gastrointestinal bleeding episodes or been subjected to endoscopy. Each investigation was videotaped; the 28 tapes were combined in sequence on one film. Twenty-two endoscopists saw the videotape and recorded the sizes of any varices they saw for each of the 28 endoscopies. Patients and Endoscopy. The patients were taken consecutively from our outpatient clinic or from the ward. Patients with varices were invited to participate in a prophylactic trial against variceal bleeding. Median age was 54 yr (range = 37 to 73 yr) in the 25 men and 3 women. Diagnosis of cirrhosis was proven by biopsy. Twenty-five patients had alcoholic cirrhosis, two had autoimmune cirrhosis and one had posthepatitic cirrhosis. Fourteen patients were classified in ChildTurcotte class A, 10 i n class B and 4 in class C (8). Patients consented to the investigation after receiving thorough oral and written explanation. The study was approved by the Ethics Committee for Medical Research in Copenhagen. The patients were lightly sedated with diazepam, 5 to 10 mg (KabiVitrum AB, Stockholm, Sweden) intravenously, before the endoscopy. The procedure was carried out using an endoscope with 30% oblique viewing optics (Olympus GIF K2, Olympus Optical Co. Ltd., Tokyo, Japan) by an expert who had performed more than 5,000 upper gastrointestinal endoscopies (P.M.). Another endoscopist (F.B.), who had performed more than 2,000 upper gastrointestinal endoscopies, followed the procedure on a video screen. Each endoscopy continued until the endoscopists agreed on

3;41

342

HEPATOLOGY

BENDTSEN ET AL.

TABLE1. Endoscopy experience of 22 endoscopists Endoscopies performed by each of the endoscopista

1,000

TOTAL

Endoscopista who had performed Endoscopists sclerotherapy

3 7 3 9 22

2 3 2 8 15

TABLE 2. Grading of varices by pairs of 22 endoscopists evaluating each of 28 patients Grading by the other observer

0 1 2 3

TOTAL (%)

Grading of variceal score by one observer 0 (Bn = 3,777)

47.9 43.9 7.5

0 . 7 100

1 (n = 3,341)

49.6 25.7 23.5 1.1 100

2 (n = 2,004)

14.1 39.2 33.1

13.6 100

3 (n = 410)

6.6 9.3 66.3 17.8 100

"n = number of observer pairs.

its result. Special care was taken to keep the esophagus fully distended for a long period. A videotape was recorded from each of the 28 upper gastrointestinal endoscopies performed. The two endoscopists selected the best 2 min of each recording. These segments were then transferred to one videotape with an interval of 30 sec between each segment. Definition and Characterization of Varices. Esophageal varices were defined as longitudinal intumescences in the lumen of the esophagus. The varices were distinguished from mucosal folds on the basis of the following characteristics: (a) varices are visible when esophagus is fully distended, (b)they have variable caliber, (c) they may be anastomosing, (d) they have a n irregular course in the esophagus and (e) they reach the cardia. Esophageal varices were divided into three sizes based on the maximal degree of protrusion into the esophageal lumen. Small varices were described as those protruding by less than half their diameter into the lumen (grade 1).Medium-sized varices were said to protrude by approximately half their diameter into the esophageal lumen (grade 2). Large varices were those that protruded by more than half their diameter (grade 3). The Observer Session. Twenty-two skilled endoscopists and a few trainees from hospitals participating in a prophylactic trial were invited to participate in the observer study. An instructional video defining esophageal varices and their characteristics and showing esophagoscopies of other patients with these characteristics was shown to the endoscopists. Before viewing the film, the endoscopists were asked to fill out questionnaires asking the approximate number of gastrointestinal endoscopies they had performed (Table l), their experiences with sclerotherapy of esophageal varices and about their specialist degrees. The film with the study series of 28 esophagoscopies was then shown. After each segment, the participants were asked to record their grading of the varices on the questionnaires, which included illustrations of the three grades of varices. Each endoscopist was unaware of the judgments made by the other endoscopists. The endoscopists were also asked about their opinion of the quality of the film (excellent, good, acceptable or poor). Data Analysis. For the evaluation of agreement, two dichotomies of the scores were used in the analysis. These were grade 0 vs. grades 1to 3 and grades 0 to 1vs. grades 2 and 3. Because only 3.7% of the evaluations were grades of 3, no attempt was made to distinguish between grades 2 and 3. Neither did we attempt to use the score values as ordered categories to construct (weighted) K values, because this would demand assessment of relative seriousness of the disagreements of grade 0 vs. grade 1 and grade 1 vs. grades 2 and 3. The scores of the single observers were compared pairwise with those provided by every other single observer. They were

also compared with the scores of the experts who performed the endoscopies. All agreements were expressed using unweighted K statistics (9). K is constructed to be zero when the agreement obtained can be entirely attributed to chance, and it attains a maximal value of 1.0 only in the case of complete agreement. A negative value of K indicates an observed agreement less than that expected by chance (excessdisagreement). Values of 0.75 to 1.00 are normally considered to be excellent, values between 0.4 and 0.75 fair to good, and those less than 0.4rather poor (10). One should be careful with these interpretations, however, because the value of K has been shown to be population-dependent (11).Comparisons between K values are only meaningful when the underlying populations are similar, such as in random samples from the recruitment population. The statistical analysis of the computed K values is described in the Appendix.

RESULTS Fourteen percent of the endoscopists called the quality of the film excellent and the remaining 86% good. Twelve of the 22 endoscopists had performed more than 600 endoscopies (Table 1)and most had experience with sclerotherapy. Seven were specialists in gastroenterology, hepatology or both. Six of these seven had performed sclerotherapy. Figure 1A shows the distribution of variceal grading for each of the 28 patients evaluated by the 22 endoscopists. In 5 of the 28 patients, all the endoscopists agreed on the presence of varices, although some controversy existed about their grading. In another 6 patients almost all observers (at least 18 of 22) agreed on the absence of varices. In the remaining 17 patients, the number of endoscopists finding varices ranged from 8 to 21. Considering all pairs of observations of all patients, we found overall agreement on the presence of varices (grade 0 vs. grades 1 to 3) of 70%. Agreement in endoscopist pairs on the absence of varices took place in 47.9% of cases (Table 2). Only 8.2% of the partners of endoscopists grading varices as 0 described the same varices as grade 2 or 3. However, the partners of endoscopists describing varices as grade 3 called the same varices grade 2 or 3 in 84.1% of cases. In only 6.6% did the other endoscopists describe the absence of varices. The frequency with which the endoscopists considered varices to be present also showed large variability

Vol. 11, No. 3, 1990

ENDOSCOPIC DIAGNOSIS OF ESOPHAGEAL VARICES BEFORE BLEEDING

343

20

15

10

5 0 PATIENTS

A umm 0 1 2 3

VARICEAL SCORE

No. of patients

25

20

15

10

5

B

OBSERVERS

EXPERTS

FIG.1 . (A) Distribution of variceal score among 28 patients sorted according to total score. Score values for expert endoscopists are also shown. (B) Distribution of variceal score among 22 observers, sorted according to total score. Score values for expert endoscopists are also shown.

(Fig. 1B).Positive findings were noted in 8 to 20 pa- ranging from -0.025 to 0.775. The average of the K tients (15.9 patients, average) as compared with 14 pa- values of agreement between the single observer and tients characterized by the experts to have varices. The the expert key was 0.49. greatest variation was found in diagnosing grade 1 When discriminating between grades 0 to 1and 2 to varices, which ranged from 4 to 15 patients for the 22 3, K values were significantly higher (p < 0.01) (Fig. endoscopists, indicating that the endoscopists had dif- 2B), with an average of 0.52 (range = 0.025 to 0.975). ferent standards for diagnosing esophageal varices. The average K value for agreement with the expert key Figure 2A illustrates the distribution of K values for was 0.50. The average K values for the eight endoscoppairwise agreement between the 22 endoscopists when ists who had performed more than 1,000 endoscopies discriminating between presence and absence of vari- and had experience with sclerotherapy were insignifices (grade 0 vs. grades 1to 3). The average K value was cantly higher for discrimination between presence or 0.38, but a large variation was present, with K values absence of varices (average = 0.43, range = 0.27 to

BENDTSEN ET AL.

344 No of pairs of observers

A

-0.1 0.0

Varices Ovs 1-2-3

0.2

0.4

0.6

0.8

1.o

08

1.0

Kappa value

No. of Dairs of observers I

-0.1

B

HEPATOLOGY

0.0

Varices 0-1vs. 2-3

0.4 0.6 Kappa value

FIG.2. (A) Frequency distribution of K values for all combinations of pairs of observers for discrimination between patients with varices (grades 1 to 3) and without (grade 0). (B) Frequency distribution of K values for all combinations of pairs of observers between patients with medium-sized to large-sized varices or small or no varices. The line on each figure is frequency of K values smoothed with a kernel of order 2 and a bandwidth of 0.05.

0.64). No difference was observed for varices graded 0 to 1 vs. 2 to 3 (average = 0.52, range = 0.28 to 0.79). K Values for internal agreement between the two experts were 0.64 for grade 0 vs. grades 1 to 3 and 1.00 for grades 0 to 1 vs. 2 to 3. The mean grading of the presence of varices in each of the 28 patients compared with the variceal grading according to the experts gave a K value of 0.66. While comparing pairwise K values for the two diagnostic dichotomies, only an insignificant correlation was found (range = 0.16) (Fig. 3). Similarly, when comparing pairwise K values relating to the expert key, no significant correlation was observed (Fig. 4). Table 3 shows that the interobserver variation among the experienced endoscopists was at the same level as that of the less practiced endoscopists. It appears that the specialists agreed more often than the nonspecialists, but the differences were not significant. DISCUSSION The tool used for verification of esophageal varices in clinical trials has in the last decade almost uniformly

been the flexible endoscope. The standard clinical method 20 to 30 yr ago was the barium esophagogram. Earlier studies have shown that radiologists disagreed on the interpretation of barium examinations of the esophagus in about one fourth of the patients (4). Other studies have shown that standard barium esophagograms probably underestimated the number of patients with varices and the sizes of the varices (12, 13). Using rigid endoscopes, two observers agreed in 67% of 39 patients investigated for the presence of varices (5). This is slightly less than the percentage found in our study, where overall agreement on the presence of varices was 70%. Most published observer studies deal with only a few observers. It is difficult to apply conclusions of a study with few observers to other observers in general because the variation probably depends on the characteristics of each observer. Furthermore, the observers participating in the study should be comparable in experience and specialist degrees to the endoscopists who in clinical practice perform the diagnostic procedures. It must be emphasized that our results may under-

345

ENDOSCOPIC DIAGNOSIS OF ESOPHAGEAL VARICES BEFORE BLEEDING

Vol. 11, No. 3, 1990

Kappa values varices 0-1 vs 2-3 I

0

1.0-

0

0.8-

0.8-

0.6-

...

0

0.6-

0

0.4-

.

0

0

0

0

0.

0.4 -

0.

0.2 -

0

0

.

0

:

. 0

0

0

0.2 -0.2 -0.2

I I

0,

0

I

1

I

1

1

0.0 0.2 0.4 0.6 0.8 1.0 Kappa values varices 0 vs 1-3

FIG. 3;. K Values regarding presence of varices plotted against values for medium-sized to large-sized varices vs. small or no varices.

TABLE3. Mean and S.D. of

K

FIG.

4.

K

Values for two dichotomies compared with expert key.

values for pairs of observers and of observers and experts grouped according to experience and specialist degree Variceal scores by grade

Characteristics of

0 vs. 1-3

pairs of

endoscopists Experienced Yes vs. yes Yes vs. no No vs. no Yes vs. experts No vs. experts Specialist Yes vs. yes Yes vs. no No vs. no Yes vs. experts No vs. experts Overall

0-1VS. 2-3

N"

Mean

S.D.

Mean

S.D.

66 120 45 12 10

0.42 0.35 0.41 0.53 0.44

0.14 0.16 0.17 0.10 0.14

0.55 0.50 0.54 0.49 0.57

0.16 0.17 0.17 0.13 0.09

21 105 105 7 15 231

0.45 0.37 0.38 0.53 0.47 0.38

0.14 0.17 0.15 0.12 0.13 0.16

0.56 0.50 0.53 0.41 0.58 0.52

0.15 0.17 0.17 0.11 0.08 0.17

"N = number of pairs of observers in each group.

estimate the clinically relevant K values because the assessments are being made from a videotape. In cases of doubt some observers might have benefited from performing the endoscopy themselves. A rather high K value was found between the scoring given by the expert endoscopists and the mean score for each of the 28 patients evaluated by the 22 endoscopists (0.66). Difficulties in interpreting the film can therefore only represent a minor part of the observer variation. In a recent Italian interobserver study, six endoscopists investigated 28 patients (6).Varix size was, as in our study, scored on three levels. This study dealt with

scoring the sizes of varices, not with discrimination between presence and absence of varices. Patients were selected for the study on the basis of varices seen in previous endoscopies. The study revealed a K value of 0.50 for size of varices, which is slightly above the value obtained in our study. The Italian study did not state how many patients were placed in each class of varices. It may be misleading to compare K values for a diagnostic procedure if the populations from which the patients are selected differ because K is populationdependent (11).In a French observer study (7) with endoscopists from different centers, a K value for pres-

346

BENDTSEN ET AL.

ence of varices was found to be 0.40, which is almost equal to our findings. It was recently shown in a prospective study (3) and earlier, in a retrospective study (14),that other variceal appearances such as red wale markings have prognostic significance on the risk of variceal bleeding. K Values for endoscopical appearances were found to range from 0.52 to 0.95 (3). We did not find any effect of endoscopical experience on the observer variation. This lack of effect might have been caused by the thorough instruction given to the endoscopists before they saw the film. Furthermore, most of the observers were quite experienced in endosCOPYWe found no correlation between the pairwise K values for the two diagnostic dichotomies (Fig. 3). Although surprising, this is probably caused by differences in perception of the scoring scales rather than the standards for diagnosing varices (Fig. 1B). In the last decade, prophylactic therapy for variceal bleeding has gained considerable interest (1).Prophylactic p blockade may be effective. Great controversy surrounds the efficacy of sclerotherapy (15).In the coming years, it can be expected that many patients without previous variceal bleeding will be evaluated for trials or given active treatment. It is still unclear what the selection criteria for patients entering prophylactic trials should be. Our study shows that great interobserver variation exists when endoscopists evaluate the size of varices. This finding implies that variceal size evaluated by endoscopy is problematic as a selection criterion for prophylactic trials. Many patients with medium-sized or large varices may not be included if the exclusion criterion is “small varices.” On the other hand, our study shows that endoscopists are better at discriminating between small and medium-sized to large varices than between their presence or absence. If patients with small varices are selected for trial, a significant number of the patients will probably be without varices. This might clutter the statistical analysis of the study. More important, some patients might be treated for a complication from which they do not suffer. The high rate of disagreement between observers in our study, despite the instructional film, demonstrates the importance of standardized observers participating in such trials. It seems reasonable to search for other diagnostic tools such as endoscopic ultrasound or Doppler flow to verify varices. APPENDIX

The collection of pairwise K values does not represent a random sample. Rather, they are interdependent because they must obey consistency demands arising from the fact that every observer is compared with every other single observer. This interdependence should be kept in mind when interpreting K values. The easiest way to do this is to pretend that we do not have as many observations as we actually have (since to some extent they express the same information). Just how

HEPATOLOGY

much independent information is provided by the pairwise K values is difficult to ascertain because this depends on the degree of consistency among the observers. However, a lower boundary corresponds to complete consistency (that is, observers only vary among themselves in their standards for varix detection), giving the degrees of freedom equal to the number of observers minus one. This lower boundary is used to construct conservative test statistics as described below. The distributions of the pairwise K values for the two dichotomies are shown as histograms in Figure 2. To give a better visual impression of these distributions, we smoothed the observed K values using the kernel method (16), which may be described as a weighted moving average method. Although the two distributions of K values are seen to deviate somewhat from normal distributions, they come reasonably close. Hence, for descriptional purposes we calculated averages and standard deviations. When comparing pairwise K values for different groups of observers (experts vs. nonexperts), we used a t test statistic. According to the interdependencies described above, however, we evaluated the test statistic in a Student’s distribution with a reduced number of degrees of freedom. Similarly, when comparing K values for the two different dichotomies, we used a modified paired t test. For the significant testing of the traditional correlation coefficient, we also reduced the number of degrees of freedom (Fig. 3). However, when comparing K values relating to the expert key for different groups of observers, we used a traditional nonparametric Mann-Whitney U test (fewer observations, no interdependencies). REFERENCES 1. Burroughs AK, Heygere D, McIntyre N. Pitfalls in studies of prophylactic therapy for variceal bleeding in cirrhosis. HEPATOLOGY 1986;6:1407-1413. 2. Lebrec D, Fleury PD, Rueff B, Nahum H, Benhamou JP. Portal hypertension: size of esophageal varices and risk of gastrointestinal bleeding in alcoholic cirrhosis. Gastroenterology 1982;82: 968-973. 3. The North Italian Endoscopic Club for the Study and Treatment of Oesophageal Varices. Prediction of the first variceal hemorrhage in patients with cirrhosis of the liver and oesophageal varices: a prospective multicenter study. N Engl J Med 1988; 319983-989. 4. Conn HO, Mitchell JR, Brodoff MG. Comparison of radiologic and esophagoscopic diagnosis of esophageal varices. N Engl J Med 1961;265:160-164. 5. Conn HO, Smith HW, Brodoff M. Observer variation in the endoscopic diagnosis of esophageal varices: a prospective investigation of the diagnostic validity of esophagoscopy. N Engl J Med 1965;272:830-834. 6. The Italian Liver Cirrhosis Project. Reliability of endoscopy in the assessment of variceal features. J Hepatol 1987;493-98. 7. Cales P, Buscail L, Bretagne JF, Champigneulle B, Bourbon P, Duclos B, Dapoigny M, et al. Inter centre observer agreement of gastroesophageal endoscopic signs in cirrhosis: a French experience. J Hepatol 1988;7(suppl 1):S109. 8. Child CG 111, Turcotte JG. Surgery and portal hypertension. In: Child CG, ed. The liver and portal hypertension. Philadelphia: WB Saunders, 196450-56.

Vol. 11,No. 3, 1990

ENDOSCOPIC DIAGNOSIS OF ESOPHAGEAL VARICES BEFORE BLEEDING

9. Cohen J. A coefficient of agreement for nominal scales. Educ Psycho1 Meas 1960;20:37-46. 10. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159-174. 11. Gjgrup T,Jensen AM. The role of kappa coefficient in evaluating reproducibility of test results. Nord Med 1986;101:90-94. 12. Degradi AE, Skorneck AB, Stempien SL. The problem of diagnosis of oesophageal varices. Bull Gastrointest Endosc 1961;8:913. 13. Waldram R, Nunnerley H, Davis M, Laws JW, Williams R. Detection and grading of oesophageal varices by fibre-optic endoscopy and barium swallow with and without buscopan. Clin Radio1 1977;2:137-141.

347

14. Beppu K, Inokuchi K, Koyanagi N, Nakayama S, Sakata H, Kitano S, Kobayashi M. Prediction of variceal hemorrhage by esophageal endoscopy. Gastrointest Endosc 1981;27:213-218. 15. Pagliaro L,Burroughs A, SBrensen TIA, Lebrec D, Morabito A, D’Amigo G, Tine F. Therapeutic controversies and randomized controlled trials (RCTs): prevention of bleeding and rebleeding in cirrhosis. Gastroenterology International 1989;2:71-84. 16. Silverman BW. In: Cox DR, Hinkley DV, Rubin DE, Silverman BW, eds. Density estimation for statistics and data analysis. London: Chapman & Hall 1986:42-43.

Agreement among multiple observers on endoscopic diagnosis of esophageal varices before bleeding.

The interobserver variation in diagnosis and grading of esophageal varices may be ascribed by characteristics of the observers as well as to the patie...
633KB Sizes 0 Downloads 0 Views