809
patient clinic-the arithmetic in this example
THE LANCET The Value of
Diagnostic Tests
THE assessment of diagnostic methods
belongs to
the backwoods of clinical research. New treatments are submitted to thorough testing, and books are written on the methodology of the controlled therapeutic trial; but no standards have been fixed for the assessment of diagnostic methods. An improvement of this state of affairs will require an increased interest in a number of very different topics, including intra-observer and inter-observer variation, standardisation of terminology, diagnostic strategies, and the very concept of disease. The first large studies of observer variation were done in the 1950s, and a few years ago KORANl reviewed the published work. He stressed the importance of taking- into account the expected chance agreement between observers, and an analysis of previous studies revealed that the agreement between, for instance, radiologists was often only half-way between chance agreement and total agreement. It would be a big step forward if new radiological, ultrasonic, and scintigraphic methods were always subjected to studies of the interobserver variation of "blind" readings. Terminological matters are always dull, but in this case they are important. Frequently, papers on diagnostic methods cite the false-positive or falsenegative rate, but these terms are highly ambiguous. The false-positive rate may express the proportion of persons, who do not have the disease, among persons with a positive test result, or it may mean the proportion of positive test results among persons who do not have the disease. The confusion was clearly demonstrated when CASSCELLS and his co-workers2 put the following question to doctors and medical students: If a test to detect a disease, whose prevalence is 1/1000, has a false-positive rate of 5 per cent, what is the chance that a person found to have a positive result actually has the disease ?" 27 of 60 participants answered "95%", obviously using the former of the two definitions of a false-positive rate, whereas 18 answered "approximately 2%", using the latter definition. (That answer is the result of so-called Bayesian reasoning:3.4 1 of 1000 persons may be expected to have the disease, and that person will presumably give a true-positive test result. 999 will not have the disease, and 5% or roughly 50 of these will give a false-positive result. Consequently, only 1 of 51 persons with a positive result will have the disease.) BX’ith higher values for prevalence-such as those suggested by years of experience in a medical out1 Koran, L. M. New Engl. J. Med. 1975, 293, 642, 695. 2 Casscells, W., Schoenberger, A., Graboys, T. B. ibid. 1979, 300, 999. 3Wulff, H. R. Rational Diagnosis and Treatment. Oxford, 1976. 4 Galen, R S., Gambino, S. R. Beyond Normality: The Predictive Value and Efficiency of Medical Diagnoses. New York, 1975.
would be more encouraging. This problem also appears in other guises. If the false-positive rate is 5% then the specificity is said to be 95%, and that term is equally ambiguous. The statement that a test is 95% specific would suggest to most clinicians that there is a 95% chance that a patient with a positive test has the disease, but epidemiologists now declare that specificity is the proportion of negative results among healthy persons. Clearly, anyone who cites such rates must explain how they were calculated and how they must be interpreted for clinical purposes. The assessment of diagnostic tests is hampered not only by terminological problems but also by ignorance of the principles of diagnostic decisionmaking. Lately SACKETTs attempted an analysis of diagnostic reasoning and concluded that clinicians use strategies varying from pattern recognition to the "exhaustive". method, multiple branching, and the hypothetico-deductive method. These strategies, however, may not be mutually exclusive, and other workers take the Popperian view that diagnosis is always hypothetico-deductive. CAMPBELL,66 for instance, recommends that medical students are taught "to generate hypotheses quickly and to test them critically rather than wasting time and money collecting information". , Most analyses of diagnostic reasoning assume the existence of an objective diagnostic truth beyond the results of our tests. Clinicians tend. to forget that the disease classification is man-made, and that the truth of a diagnosis may be almost a matter of opinion. It would, for instance, make little sense to discuss the specificity of a new test for the diagnosis of systemic lupus erythematosus, as we ourselves determine the specificity. If we decide always to make the diagnosis when the test is positive, then it will become 100% specific (irrespective of the definition of that term), but the reasoning is clearly circular. Many tests used in the daily routine have probably been introduced in this dubious manner, and once they have been introduced we soon feel that we cannot do without them. Admittedly, S.L.E. is an extreme example, but many other diseases present similar problems from a practical point of view. Common diseases such as myocardial and pulmonary infarction may be fairly defined in terms of morbid anatomy, but when a patient suspected of one of these diseases survives, the diagnosis rests only on the test results. The accuracy of those results cannot be assessed independently. It has been said that diagnosis is not an end in itself, but only a mental resting place on the way to treatment, and in the final analysis the assessment of diagnostic methods must be based on that 5. Sackett, D. L. Clin. invest. Med 1978, 1, 37. Clinical and Investigative Medicine, a new Canadian journal, is to be published quarterly: annual subscriptions $50 (libraries etc), $25 (personal), $12.50 (students). Inquiries to Journals Division, Pergamon Press, Headington, Oxford OX3 0BW, U.K. 6. Campbell, E. J. M. Lancer, 1976, i, 134.
810
point of view. A new method is good if it helps to classify patients in such a way that we obtain better treatment results, and it is superfluous if that is not the case. New diagnostic methods, especially those which are expensive or liable to cause complications, ought to be subjected to controlled trials in order to prove that their routine tients’ advantage.
Schools Do Make
a
use
will be
to
pa-
Difference
discredit to Prof. MICHAEL RUTTER and his colleagues that the impact of their book on "secondary schools and their effects on children"1 owes as much to the fact that it says what people are ready to hear as it does to the evidence which it adduces. What it says is simple and unsurprising-that children’s social and academic development is greatly affected by the character of the school they attend. But it is only a few years since a different message was coming through from American sociologists such as J. S. COLEMAN and SANDY JENCKS. The bowdlerised summary of their work was: "Schools make no difference". COLEMAN showed on the basis of massive surveys of American schools that differences between schools were unimportant by comparison with the differences which the pupils themselves brought to school. JENCKS emphasised the importance of social class and parental attitudes and questioned the value of largescale programmes aimed at compensating, through education, for social disadvantage. Of course, by the time COLEMAN and JENCKS spelled out their message in the early 1970s, Richard Nixon had succeeded Lyndon Johnson, and there was a reaction against open-handed federal spending on compensatory education programmes. Daniel Moynahan had coined his bon mot about benign neglect. The evidence collected by the researchers which produced the glib, shorthand conclusion that "schools don’t make any difference" was largely of a kind which was bound to minimise the school contribution and maximise that of the home and social environment. In a valuable introductory chapter to Fifteen Thousand .HoMf RUTTER and his colleagues review the previous work in this sphere and the false conclusions which popularisers have drawn. Many of the tests used in the American research depended as little as possible on the subject content of the school curriculum (how could it be otherwise in a vast survey spread over many school systems?). Many of the variables the Americans isolated proved to be irrelevant but measurable; those more relevant were too difficult to measure. Most important, the basis of the federal programmes to compensate educaIT is
no
1. Fifteen Thousand Hours. By MICHAEL RUTTER, BARBARA MAUGHAN, PETER MORTIMORE, and JANET OUSTON. London: Open Books. 1979. Pp. 279.
£7.50, hardback; £3.50, paperback.
disadvantage had been a quest for a more equal society: COLEMAN found little evidence that
tional
compensatory education could make up for social inequality generally. But this is not the same as showing the schools to be impotent or ineffective with regard to pupils’ learning. All this seems no more than common sense. But a great deal of interest still surrounds the subject of what schools can and cannot do, and why one school is better (or worse) than another. If differences in the quality of the children’s performance are wholly explained by differences. outside the school (such as social class and family attitudes) a debilitating determinism hangs over the whole educational system. The sociologists of education have successfully induced just this sense of powerlessness in many schools-just as, earlier in the century, a generation of psychologists succeeded in proving to teachers that the children under their instruction came with rigid predetermined limitations which could be expressed in terms of innate i.Q. Both these forms of pedagogic Calvinism contain enough truth to shape the climate of educational opinion. Both in their way are a hindrance to that combination of faith and works on which miracles of learn-
ing depend. have countered this with a aimed at isolating the effect of longitudinal study the school. They took the 1970 intake of twelve London comprehensive schools and tested, categorised, and generally appraised the pupils at the age of 10 (before they entered the secondary schools), again at the age of 14, and finally at the minimum leaving age of 16. Comparing the assessments at 10 with those at 14, and with examination results at 16, they found that the children in some schools did much better than those in others, and that the differences were not wholly explained by the difference identified at the age of 10. For example, school A received an entry of 65 pupils, 31% of whom had "behavioural difficulties". By the age of 14, this 31% had been reduced to less than 10%. School B, on the other hand, took in 34% of "bad hats" and 3 years later this had risen to 48%-"a five-fold difference between schools". The same kind of evidence was examined for academic performance, yielding the same conclusions about the schools’ direct contribution: the increment in some schools seems to be very good; in others, very poor, and this did not correspond closely to the early measure of verbal reasoning. The research design provided for a great deal of information to be collected by questionnaire, interview, and observation about the schools themselves. This brought out details of academic organisation and practice-about the amount of teaching and homework, the zeal or slackness of the staff, the use of the library, the attitude of the head to RUTTER and his
his
team
pastoral responsibilities,
and about
discipline.