The RANZCP membership examination: a review of the relevant literature.

THE RANZCP MEMBERSHIP EXAMINATION: A REVIEW OF THE RELEVANT LITERATURE John Condon

The design of any assessment system, as well as the way it is scored, must be intimately linked to the purpose it is to serve. Previous literature on the RANZCP membership examination has tended to focus exclusively on issues of reliability and validlty at the expense of broader concerns. The present paper selectively reviews the literature on assessment in education highlighting the differencesbetweennormative and criterion referencingand their implications for the profession. Published studies of the reliability of post-graduate written and oral examinations are reviewedfocussing particularlyon those conducted in psychiatric settings. Finally, the arguments for and against continuous assessment are summarised. Greater familiarity with the literature may result in more informed and constructivedebate about the assessment process than has hitherto been evident. Australian and New Zealand Journal of Psychiatry 1991; 25:383-391 For many members of the College, the RANZCP membership examination has been a source of stress and often distress. It is frequently experienced by many trainees as an onerous and painful potential impediment to progress along their chosen career path. Criticism has been levelled at both its reliability and validity which has led to doubts in some quarters as to its appropriateness as a measure of clinical competence. Such doubts potentially exacerbate the fear and anger components which characterise many trainees’ “relationship” with the examination system. It is unfortunate that almost all discussion of the membership examination begins and ends with consideration of its reliability and validity (especially as regards the viva component). Clearly, such issues are of great importance. However, exclusive preoccupa-

Department of Psychiatry, Repatriation General Hospital, Daw Park, SA John Condon MB, BS, Dip. Psychother.(Adelaide), FRANZCP, Senior Staff Specialist

tion with them diverts attention away from a number of broader concerns. Both the design of any examination, as well as the way it is scored, must be intimately linked to the purpose it is to serve. The latter, in turn, must be geared to some concept of “standard”. Such fundamental issues warrant far more discussion than is usually accorded them. In this paper, I have attempted to create a broad context or background against which the RANZCP examination can be viewed. This context derives from the literature on assessment in education in general, and assessment in post-graduate education in particular. Thereby, on-going debate about the examination may become better informed and more fruitful than has hitherto been the case. My purpose is to highlight issues and raise questions, not to provide my personal impressions or solutions; nor to compose a defence of the present system. I have a bias to the extent that I believe the current system is imperfect, yet meters out vastly more justice than injustice. To the extent that the latter does occur,

Downloaded from anp.sagepub.com at UNIV CALIFORNIA SAN DIEGO on March 14, 2015

384

THE RANZCP MEMBERSHIP EXAMINATION: REVIEW OF LITERATURE

the onus is upon all Fellows of the College to suggest specific ways in which its occurrence may be minimised. This is an extraordinarily difficult task, in contrast to the extraordinarily easy one of simply highlighting the system’s obvious deficiencies. Finally, I would make clear that this paper reflects my own thinking and should not be taken to represent that of the Committee for Examinations of which I am a member.

What is the examination designed to assess? The purpose of the examination is to ensure that those obtaining College Membership have a level of knowledge and skills commensurate with those of an adequate, not necessarily exemplary, consultant psychiatrist. In the medical literature, there is increasing realisation that examinations are not merely a means of quality control, but can potentially end up directing and re-shaping the very attributes which they attempt to measure. Thus, unless care is taken, there is a danger that the tail (examination) begins to wag the dog (professional competence) [ 1). There is something inherently disquieting about the notion of a psychiatrist designed by a committee. Although the medical literature makes much of the distinction between knowledge and skills, defining the interface between the two is far from straightforward. Even in examinations specifically designed to assess “skills”, closer scrutiny reveals that a majority of the items actually involve simple recall of knowledge [ 2 ] . In psychiatry, the notion of “skill” is intimately related to entities such as empathy and judgement which, being highly abstract, are less than easy to measure. It is well established that, in all branches of medicine, reliability and validity diminish as a test moves from knowledge towards clinical competence [I]. The problems inherent in assessing clinical competence begin with the difficulties in reaching any consensus regarding its definition. A reasonably comprehensive knowledge of the discipline is an obvious prerequisite. In psychiatry, the ability t o take a history and perform a sound mental state examination leading to a plausible formulation, a sensible diagnostic statement and a reasonable management plan are also necessary ingredients. However, it could be argued that these latter are, in a sense, epiphenomena which reflect abilities along a number of other underlying dimensions of clinical competence. Kalucy [ 31 has

described several of these including the ability to assign appropriate priorities to the various data. This is essentially a decision-making process involving filtering out irrelevancies and judging the clinical significance and implications of what remains. Kalucy [3] believes that the process of formulation involves the candidate’s making “a plausible set of guesses” regarding “how this patient happens to be in the dilemma they are in, at this point in time and also to be able to see what it might mean if they (the candidate) are wrong”. He stresses the need to view the patient and their problems against the broader context of the sociocultural environment. Kalucy [3] believes that, inherent in clinical competence is another aspect of “judgement” in which the candidate recognises the potential limitations of the data and can tolerate uncertainty as evidenced by resisting the temptation to cling rigidly and defensively to one particular narrow view of the patient’s difficulty. Langsley and Yager [4] in 1988 surveyed 223 American psychiatrists regarding what they considered the most important skills for a consultant psychiatrist. The top six (of 48) skills, in decreasing order of importance, were: I) The ability to conduct a comprehensive diagnostic interview. 2) The ability to recognise counter-transference. 3) The making of an accurate diagnosis. 4)The ability to evaluate the need for hospitalisation. S ) A capacity to demonstrate interest, tact and compassion for patient and family. 6 )The ability to assess suicidal and homicidal potential. The authors believed that the appropriate way of assessing such skills involved “more frequent and on-going assessments” in the clinical setting. Norman [ 5 ] has provided a comprehensive overview of the various difficulties inherent in the assessment of clinical competence in medical settings. At least four observations are of potential relevance to psychiatry. First, many authorities believe that to assess skill we must assess ‘‘ how candidates think”. In this regard, there is some agreement in the literature that one crucial aspect of clinical skill is the ability to form a multitude of hypotheses early in the clinical encounter and to then carry out a process of data gathering specifically geared to confirming or refuting them 161. Such an approach characterises the “expert” in contrast to the “novice” who attempts to gather all relevant


JOHN CONDON

information and only then shift to an hypothesis generation /testing mode. Second, Norman points out that the notion of “clinical competence” as a construct presupposes that the various types of knowledge and skills are intercorrelated, i.e. a “good” candidate will be “good” at everything. There is little evidence in the medical literature to support or refute this assumption. Much of the agonising which occurs at medical examiners’ meetings less often arises because a candidate is borderline on all attributes but rather because his or her performance is inconsistent across the range of attributes. Thus, some medical educationalists argue that a major source of unreliability is the futile attempt to lock into one grade (pass / fail), a range of abilities which are poorly intercorrelated [ I ] . For example, in one large American study of intern competence. thoroughness of laboratory work up at admission was found to be inversely related to quality of care [7]. Third, Norman highlights the finding in many studies that there is a “great discrepancy” between knowledge and the effective application of that knowledge in clinical settings (i.e. skill). He concludes “studies of actual performance have a clear role to play in defining clinical competence”. Fourth. it would seem that experienced consultants frequently do not practice medicine in a manner which they were taught in medical school or during postgraduate training. In assessment, there is a risk of confusion by double standard, i.e. is the yardstick to be how they actually practice or how they believe they ought to practice? Finally, Norman provides an overview of various approaches to defining “competence” in clinical medicine. These range from considerations deemed important by recognised authorities, through more systematic methods geared to analysing the abilities and attributes which consultant physicians may have in common to perspectives based on patient satisfaction. He concludes that all such methods tend to result in a “statement of the ideal”, and that, although there is agreement as to what constitutes very good and very poor practice, it is extremely difficult to interpolate a standard of “adequate” practice between these two poles.

Adequate compared to what?

In the jargon of education, our examination as it currently stands is a “mastery test” or rather a series of “mastery tests” [8-9). Although few would argue that

385

our standard is not “high”, we neverthele ing a minimum acceptable level of performance for each component. This standard is set by the Committee for Examinations and it is regarded as an absolute standard agreed in advance. A candidate’s status is determined by his or her score, irrespective of the scores of other candidates. Those who satisfy the minimum standard on all components are deemed to have achieved “mastery”. The results are dichotomised as pass / fail and merit grades are not published. Finally, it is theoretically possible for all candidates to pass. Examinations of this type are referred to as “criterion-referenced” I 101. In contrast, there are many examinations (including the Higher School Certificate-determined entry to a variety of courses) where the objective is to select the “best” candidates to fill a predetermined number of places or vacancies. The aim is to “cream off’ candidates scoring above a pre-determined percentile (or conversely to fail a predetermined percentage of candidates). In this setting, an individual candidate’s raw score has no meaning in its own right, but only in relation to the scores of other candidates sitting the examination, i.e. it is the candidate’s ranking which counts. “Mastery” does not guarantee success and there may be no absolute performance criteria. The “standard”, in so far as there is one, is really determined by the candidates themselves. Examinations of this type are referred to as “noimative-referenced” 181. The culling which occurs in some states in selection for entry to the training programme, although not strictly an examination, is very much an exercise of this type in that success ultimately depends on the number of posts available and who else happens to apply. Thus, our entry examination is frequently of a “creaming off’ type and hence utilises norm-referencing. In contrast, our exit examination is of a “mastery” type and should utilise criterion-referencing. Educationalists believe that, in any examination system, i f the distinction between normative and criterion-referencing becomes blurred, confusion inevitably arises 11 1, 121. For example, if our written examination is criterion-referenced and 90% of candidates fail a particular question according to predefined criteria, then provided the question is considered a legitimate one, no further action would ensue. In contrast, in normative-referencing, the ques-


386


tion would be designated as having been “too hard” and the pass mark shifted accordingly. The distinction between normative and criterionreferencing has far broader implications, of which I will mention only four: I) Ellard [ 131 has put forward the view that the College, in its approach to selecting new members, is best served “by pursuing a policy of unremitting elitism, discarding all but the best”. If we were to adopt this view and if we believed our exit examination provides a quantitative measure of the range of abilities required for competent practice as a consultant psychiatrist, then logically the examination should be converted to norm-referencing. 2) Criterion-referencing results in the number of new consultant psychiatrists appearing each year being directly proportional to the number of trainee positions which were filled five years previously. Whether or not there is any logical rationale for gearing the nation’s supply of consultants to the number of training posts remains moot. Making the criteria “tougher” may have little effect on the numbers passing, since candidates may simply work harder to achieve the higher standard. In contrast, in norm-referencing, shifting the cut-off point has a direct and immediate effect on the output rate. 3) If relative proficiency together with “market forces”, determined the award of College Membership, this may well be regarded by many as draconian. Unhealthy competitiveness may ensue among trainees. Nevertheless, in the British system, although the award of membership is not determined in this way, promotion to Senior Registrar or Consultant status is very much done by a “creaming off’ process. 4) One of the potential problems with criterion-referencing is that mastery is often a function of time. Thus, if a candidate sits the examination a sufficient number of times, eventually “mastery”ofthe examination is likely. Such expertise in exam technique may not reflect a corresponding clinical expertise. In addition, scores in any assessment system will always contain a component of random error variance. Millman [I41 has described, in detail, how repeated sitting of an examination may lead to success by chance alone. Millman argues that, in medical examinations, passing incompetent candidates is a more serious error than failing competent ones. Consequently, he suggests that the pass standard should be successively increased as a function of the number of attempts. He argues that this is no more “unfair” than

being subjected to increased car insurance premiums following any vehicle accident. The analogy is perhaps a dubious one. Mehrens & Lehmann [ 101 suggest that chance success following multiple attempts is much less likely in norm-referenced examinations since less able candidates much more consistently precipitate out at the bottom.

Validity of the membershipexamination To the extent that all segments of the examination appear to have relevance to the practice of psychiatry, the examination could be said to have face validity. As regards content validity, it could be argued that the examination fails to adequately assess the breadth of functions required in a consultant psychiatrist. For example, qualities required to create a relationship with a patient characterised by empathy, rapport and respect are not assessed to any significant degree. Likewise, the interpersonal skills required to function as a member of a team with medical and non-medical professionals are omitted from consideration. Supervisor reports (which do not currently form any part of the assessment system) may be of use in this regard. The issue of validity in any specialist accreditation examination ultimately hinges upon whether the knowledge and skills as assessed in the artificial examination system are reflected in subsequent competence evidenced by the consultant in a private, unobserved clinical setting. This, of course, is the issue of predictive validity, and MEDLAR searching revealed only one study of the predictive power of postgraduate accreditation examinations. This was the study of ninety-two U.S. consultant anaesthetists all of whom had passed Board Membership despite having given at least four “highly dangerous” answers in the final MCQ. The hypothesis under test was that they constituted a “smart but dangerous” sub-group. At follow-up, no evidence of increased dangerousness was detected [ 151.

Reliability of written and oral examinations Written examinations The majority of Colleges throughout the Western World (including the U.S. and Canadian Psychiatric ones) have abandoned, on average ten years ago, the use of essay examinations in favour of Multiple


JOHN CONDON

Choice Questionnaires (MCQs). It would be erroneous to suppose that this resulted only from reliability concerns. Much of the impetus originated from the difficulty of finding highly qualified examiners prepared to mark large numbers of essays and the high administrative cost of essay exams as compared to MCQs [ 5 ] . As regards MCQs in psychiatry, Greenblatt et a1 [I61 in a study of 200 U.S. trainees, found that the multiple choice questionnaire had considerable predictive power in terms of success on the subsequent orals. Thus, 88% of those who passed the written examination on their first attempt obtained membership within the next four years, whereas only 12% of those who failed the written examination did so. Such impressive predictive power must however be considered in the light of the knowledge that MCQs are invariably norm-referenced, usually with cut-off points which result in 40-50% of candidates automatically failing regardless of their absolute performance. Thus, the U.S. and Canadian writtens are very much a “creaming o f f ’ operation. The most important single observation which emerges from the literature on reliability of essay examinations is the diverse and conflicted nature of the research findings. It is equally easy to find whole sets of studies to demonstrate inter-rater reliabilities in excess of 0.8 or below 0.2 [ 5 ] . Despite disagreement on absolute reliability, the literature is much clearer in identifying factors which impinge upon such reliability. Four considerations emerged consistently: 1) Norm-referencing carries a significantly higher level of reliability in essay marking than criterion-referencing [ 1 11. Thus, examiners seem much more consistent in ranking candidates in order of merit than in deciding whether candidates meet some external performance criterion. 2 ) In the older studies of reliability of medical essay marking, interrater reliability was typically between 0.2 - 0.3 for long essay answers and approximately 0.5 for short answers [S, 171. Further studies have demonstrated that if criterion-referencing is utilised, acceptable reliability requires two conditions to be satisfied [ 181. First, the examiners must be experts; second, there must be a detailed explicit scoring system which must be rigorously applied. Under these circumstances, inter-rater reliability can be increased to 0.7 - 0.8, the higher figure applying to short answer questions. If either of these criteria is not met, inter-

387

rater reliability falls to the above-mentioned levels and pass / fail gradings do not achieve agreement greater than those expected through chance alone [ 171. 3) The technique of having a second examiner remark essays which have been failed by a previous examiner is considered dubious [17]. Even if the second examiner is blind to the first examiner’s assessment, he or she is inevitably aware of the reason for the remark. Almost invariably the two examiners will agree at a level which is grossly inflated above that which has ever been achieved between truly blind raters. 4) Although there is little doubt that inter-rater reliability can be improved through the use of explicit scoring systems, it has been suggested that leniency (or strictness) is largely a character trait upon which training has little if any impact 1IS]. There is a viewpoint that the essay examination .nay be especially worthwhile in psychiatry as opposed to other specialties. The exponents of this view point out the relevance of verbal skills, abilities to formulate hypotheses, assess probabilities and arrive at judgements often in the face of considerable uncertainty. Although this argument has some face validity, at present it is bereft of empirical support.

Viva examinations Many of the observations made above regarding reliability research on essay examinations are equally applicable to oral examinations. Muzzin and Hart [ 201 point out that there are at least twenty studies in the literature exploring the measurement properties of oral examination in medical settings. All too frequently writers (for example, Minas & McGorry 1211) select a few studies from this pool to support their particular line of argument. It is important to recognise that, even in psychiatric settings, viva examinations differ considerably in their style and format. For example, in the Canadian system, only one viva is administered and the candidate’s interview with the patient is observed and forms part of the assessment. In the U.S. system, in addition to a viva following a “live patient” interview, candidates are given a second viva after observing a video of a patient interview. There are many other variations, such as whether each member of the pair of examiners reaches a verdict independently or whether, via discussion, they strive for consensus. Thus, considerable


388

THE RANZCP MEMBERSHlP EXAMINATION: REVIEW OF LITERATURE

caution is warranted in generalising the findings from overseas studies to our own viva examinations. Muzzin & Hart [ 2 0 ] have comprehensively reviewed the literature on oral examinations in medical fields. Again, the point is made that the huge administration cost of such examinations has been a major factor in their less frequent use in the U.S., quite aside from concerns relating to reliability. These rev ie wers conc I ude: “The longevity of the oral examination in medical evaluation cannot entirely be attributed to the stubborn and illogical persistence of a culturally established tradition: despite its short-comings, the oral may be as good (or bad) a method as any other to assess clinical competence.” Views on oral examinations range from their being construed as primitive rites of passage lo membership of “The Club” to their providing valid and reliable measures of clinical competence not assessable by other means. The intensity of feelings generated by viva examinations has been noted for centuries. Candidates in medieval universities were required to take an oath that they would not “take vengeance on the examiner with knife or other sharp instrument” (221. Whether such passion is a direct result of the face-to-face nature of the encounter (which somehow gives the impression that assessment is geared to what the candidate is ratherthan what he/shedoes)or whetheritarises from concerns about reliability is unclear. Potential sources of unreliability include failure to adequately take account of patient difficulty; variation between examiners in their interpretation of the standard; inconsistent coverage of clinical areas; candidate anxiety; personality variables; halo effects (in which scores in one single area are inadvertently generalised to other areas); and a variety of situational factors. The Committee for Examinations attempts to maintain a satisfactory level of reliability in the orals by four means: 1. Examination by pairs of examiners (one of whom is always a member of the Committee for Examinations). Muzzin & Hart [20] stress the importance of paired examiners in enhancing reliability. 2 . Detailed discussion of all borderline and failed candidates at the subsequent examiners meeting. 3. Providing examiners with a patient summary, as well as having every patient briefly interviewed by both examiners to enable assessment of the accuracy of the summary and an estimate of patient difficulty.

4. Regular training sessions for examiners to assist in clearly defining a standard. In the field of psychiatry, the reliability of the Canadian viva exams have been the focus of more empirical study than any others [23 - 271. In evaluating the Canadian research, it is crucially important to recognise that most of these studies have utilised situations where examiners view a video tape of a viva. Thus, the raters are passive observers of two of their colleagues examining a trainee. This is a highly artificial situation in that the observing examiners have no opportunity to test out their own hypotheses about the candidate’s strengths and weaknesses by questioning him or her, as would be the case in a real examination. In a much quoted study of this type, Leichner et ul [23] showed a video tape of a Canadian candidate’s interview and viva to 5 1 examiners (approximately one-third were academics, one-third non-academics and one-third trainees). Examiners independently decided ratings of pass, fail or borderline. Of 105 possible pairings, agreement within pairs occurred in 42%. Chance agreement would occur in 33% and this level of agreement is statistically no better than chance. Pass / fail disagreement occurred in 31% of academic pairs, 54% non-academic pairs and 27% of trainee pairs. In a subsequent paper [24] on the same study these investigators examined the relative contributions of each component of the exam to the variance of the 15 academic examiners’ marks. The candidate’s performance in the viva accounted for the greatest portion of the variance, whereas performance during the interview contributed little. However, the examiners’ impression of the degree of difficulty of the case (formed during observation of the interview) accounted for a substantial portion of the outcome variance. Thus, it would seem that when the examiners observe the candidate’s interview, they allocate much more weight to the degree of difficulty of the patient than the candidate’s interview skills. Using similar methodology, McCormick [ 2 5 ] showed videos of three Canadian candidates‘ performance on the interview and oral examination to three separate groups of psychiatrist examiners, each at three different centres. Overall, there was very poor inter-rater agreement regarding pass/fail on two of the three candidates. However, within any one centre, agreement was better despite there being no overt discussion between examiners. The author believes that non-verbal communication took place between


JOHN CONDON

the examiners accounting for the “within group” agreement but “between group” disagreement. He refers to this as a “group contamination effect”. In discussion, he raises the issue of whether it is “a comfort” to know that one examiner can sway the other so that the ultimate mark “seems agreed-on”. Lowy & Prosen [27] report on an investigation of the Canadian College’s attempts to train 27 examiners, using video taped interviews, to enhance inter-rater reliability in the orals. The results are not reported in detail but were described as “impressive” in that “26 of the 27 agreed with regard to passing or failing a candidate” after training. In a separate exercise, the inter-rater agreement within pairs of examiners actually examining in the real exam was found to be of the order of 0.8 in each of five areas including MSE and clinical competence. It is unfortunate that Lowy & Prosen [27] do not report their data in more detail. The paper is important since it is the only paper reporting inter-rater agreement levels between examiners participating in the real exam as opposed to observing video tape interviews. The very much more respectable inter-rater agreement suggests that the video exercise (albeit a most valuable training aid) results in !ipuriously low inter-rater reliability due to the examiner’s non-participation and hence their inability to verify their impressions. In interpreting studies of this kind, it is important to distinguish inter-rater agreement from inter-rater reliability. “Agreement” refers to the extent to which examiners’ absolute scores agree. “Reliability” is a quite different notion and relates directly to correlation (i.e. do the gradings of one examiner correlate with those of another‘?).In theory, it would be possible that if one examiner failed all candidates and his co-examiner passed all the same candidates, inter-rater agreement would be zero but inter-rater reliability could be perfect if the examiners agreed on the ranking of the candidates within the two (pass and fail) subgroups. In addition, in quoting agreement or reliability statistics allowance must always be made for chance agreement. Tinsley & Weiss (28 1 have comprehensively reviewed statistical approaches to inter-rater reliability. They point out that the use of simple correlation techniques can be quite misleading. Ross and Leichner [26] asked 26 Canadian trainees to provide a written MSE assessment of a video taped patient interview. They then compared two examiners’ assessment of this written report with the candidate’s

3x9

scores on the MSE component of the final viva. They found a high level of “agreement” between trainees MSE performance on the video and performance in the final exam. However, the statistics utilised (Pearson correlations) are inappropriate and this conclusion probably cannot legitimately be drawn. Tardiff [29] demonstrated that under-graduate medical student’s ability to identify psychopathology on a video was not related to scores on psychiatry MCQ’s, essays or supervisor’s reports. He concludes that the skill of identifying psychopathology can only be evaluated in real or video-tapedclinical settings and the other techniques are largely limited to testing knowledge. Talbott [30] compared the examination results of 2,236 candidates on the live interview or viva with those on simulated interview or viva. Pass / fail concordance between the two examinations was 82%. however candidates did consistently better on the simulated examination, and those who failed tended to do so on the live interview. The author concludes that the live interview is more discriminating and should not be abolished in favour of video techniques.

Continuous assessment It is apparent that there is a ubiquitous trend, at all levels of education, away from the “fina1”examination being the sole criterion and towards continuous assessment. The literature on post-graduate accreditation focuses mainly on two approaches to continuous assessment.

Supervisor reports At present, supervisor reports play no role in the RANZCP assessment system, apart from providing documentation that the trainee has satisfactorily completed the training requirements. In contrast, In-Training Evaluation Reports (ITER) are an important component of the Canadian assessment system, especially to help decide borderline pass / fail dilemmas. The Canadians argue that individual and clinical supervisors are ideally placed to observe the trainee’s performance in day-to-day clinical settings. Their assessments (based on such observations) should have, as Small [ 31 I put it, “compelling face validity”. Two main arguments have been advanced in the literature against the use of supervisor reports. First, i t


390


has been suggested that many supervisors may baulk at providing negative criticism of trainees, especially where the trainee (who legally must view the report) is involved in an on-going work or social relationship. The second argument has been summarised by Klein and Babineau [32] who believe that the unconscious needs of both supervisor and trainee may powerfully contaminate objectivity. These authors believe that, in psychiatry, one cannot isolate the trainee’s performance from his personal characteristics and hence any evaluation is, to some extent, a “character analysis”. They believe trainees have a very limited capacity to “metabolize criticism”of what they are (as opposed as to what they do). The Canadian experience suggests that supervisor reports can provide a meaningful assessment of clinical competence, provided that the dimensions of the assessment are clearly pre-defined on a standardised form. It would seem that free format assessments in the style of “character references” have no objective validity whatever.

Interim examinations Many Colleges require trainees to undergo some type of formal assessment of both knowledge and skills at the end of their first year of training. Such assessments do not “count” towards the final examination since the standard required in the latter is obviously higher. Rather, their main purpose is to alert the trainee, at an earlier point in training, to unsatisfactory aspects of performance. In addition, the identification of trainees who are clearly unsuited to continue training is facilitated. These Colleges argue that it is in the best interests of all concerned that this should take place after one rather than three (or often more) years. Obviously, the same considerations regarding reliability and validity arise discussed above, as does the question of whether such as an assessment should be norm or criterion-referenced.

Conclusions The final RANZCP examination will probably remain criterion-referenced. Those who argue that it is draconian to fail a trainee (after three years of training) who meets the standard, simply because other trainees exceed it, will probably hold sway over those who propose tackling the manpower issue by “creaming off’ in the final examination. In this regard,

Australia will remain “out of step” with the U.S. and Canada where the MCQ clearly screens out a predetermined percentage of candidates prior to the orals. With regard to examinations held earlier in training, the arguments for norm-referencing may be more tenable. The present examination format has been in use for over two decades. The recent announcement that the relevant by-laws are to be reviewed provides an opportunity for the College to examine the assessment system in the light of current literature on medical and non-medical education. The existing literature, as reviewed in this paper, cannot provide definitive answers to most of the crucial questions raised in attempting to assess clinical competence; nor can it provide a blue-print for an “ideal” examination. It can however provide a background for more informed debate about the assessment process and highlight the need for on-going scrutiny and research into this process. The onus is, I believe, upon the College Membership as a whole to constructively critique the examination process. The assessment process is central to the maintenance of practice standards in Australian and New Zealand psychiatry and any major deficiencies in it potentially have far reaching implications for the profession as a whole.

Acknowledgements The author wishes to acknowledge the helpful critique of this paper by Prof Bruce Singh.

References Small SM, Regan PF. An evaluation of evaluations. American Journal of Psychiatry 1974; 131:51-55. i. Jayanickrarnarajah P. Oral examinations in medical education. Medical Education 1985; 19:290-293. 3. Kalucy RS. The examiner’s task: RANZCP clinical examinations. Unpublished report to the Committee for Examinations, 1987. 4. Langsley DG, Yager J . The definition of a psychiatrist: Eight years on. American Journal of Psychiatry, 1988; 145:469-475. 5 . Norman GR. Defining competence: A methodological review. In: Neufe1d.V.R.. Norman,G.R. (Eds), Assessing Clinical Competence. New York: Springer, 1985. 6. DeGraff E, Post GJ, Drop MJ. Validation of a new method of clinical problem solving. Medical Education 1987; 21 :213-218. 7. Sackett DL. Clinical diagnosis and the clinical laboratory. Clinical Investigations in Medicine 1978; 1:37-43. 8. Thyne JM. Principles of Examining. London: University of London, 1974 9. Payne DA. The Assessment of Learning. Heath, Lexington, 1974.


JOHN CONDON

10. Mehrens WA, Lehmann IJ. Measurement and Evaluation in

Education and Psychology. New York: Holt, Rinehart & William, 1984.

I I . Theobald J. Classroom Testing. Melbourne: Longman, 1974. 12. Lindvall CM, Nitko AJ. Measuring Pupil Achievement and Aptitude. New York: Harcourt Brace Jovanovich. 1975. 13. Ellard J. Where we are and where we are going. Australian & New Zealand Journal of Psychiatry 1988; 22:258-263. 14. Millman J. If at first you don’t succeed. Educational Researcher. 1988; 18:s-9.

15. Slogoff S, Hughes FP. Validity of scoring “dangerous answers” on a written certification examination. Journal of Medical Education 1987; 62:62.5-631 16. Greenblatt M. Carew J, Pierce C. Success rates in Psychiatry and Neurology Certification Examinations. American Journal of Psychiatry 1977; 134: 1259-236 1. 17. Street MR. The reliability of expert judgement in the marking of State nursing examinations. Journal of Advanced Nursing, 8.22 I 226.

18. Nichols EG. Miller GK. Inter-reader agreement in comprehensive essay examinations. Journal of Nursing Education. 1984; 23:64-69. 19. McLeon PJ. The impact of educational interventions on the reliability of teacher’s assessment of student care reports: A controlled trial. Medical Education 1988; 2:l 13-1 17. 20. Muzzin LJ, Hart L. Oral Examinations. In: Neufeld. VR. Norman. GR (Eds), Assessing Clinical Competence. New York: Springer, 1985. 2 I . Minas IH. McGorry PD. The RANZCP vivas: A suitable case for examination. Australian and New Zealand Journal of Psychiatry 1988: 22:432-43.5.

39 I

22. Lasagna L. The mind and morality of the doctor. Yale Journal of Biological Medicine, 1965; 37:345-348. 23. Leichner P, Sisler GC. Harper D. A study of the reliability of the

clinical oral examination in psychiatry. Canadian Journal of Psychiatry, 1984; 29394-397. 24. Leichner P, Sisler GC, Harper D. The clinical oral examination in psychiatry: Association between sub-scoring and global marks. Canadian Journal of Psychiatry 1986; 3 I :750-75 I . 25. McCormick WO. A practice oral examination rating scale: Interobserver reliability. Canadian Journal of Psychiatry 19X I ; 26:236349. 26. Ross CA, Leichner P. Residents’ performance on the mental status examination. Canadian Journal of Psychiatry 1988; 33: IOX1 I. 27. Lowy FH, Prosen H. The Canadian certification examination in psychiatry. Canadian Journal of Psychiatry 1979: 24:292-301. 28. Tinsley HE, Weiss DJ. Interrater reliahility and agreement of subjective judgements. Journal of Counselling Psychology. 1975; 2~3.58-376.

29. Tardiff K. A videotape technique for measuring clinical skills: Three years of experience. Journal of Medical Education 198 I : 56: 187- 191. 30. Talbott JA. Is the “live patient” interview on the boards necessary? American Journal of Psychiatry 1983; 140:8YO-X93. 3 I , Small SM. Limitations and values of evaluation techniques in psychiatric education. American Journal of Psychiatry 1975: 132:52-55. 32. Klein RH, Babineau R. Evaluating the competcncc of trainees. American Journal of Psychiatry, 1974; I3 I :788-79 I .


The membership examination.

Electrophysiological examination in uveitis: a review of the literature.

Smoking and membership in a fraternity or sorority: a systematic review of the literature.

Is statin-induced diabetes clinically relevant? A comprehensive review of the literature.

RANZCP Abstracts.

Giant Hypopharyngeal Fibrovascular Polyp: A Case Report and Review of the Relevant Literature.

Practice and effectiveness of breast self examination: a selective review of the literature (1977-1989).

Manual examination in the diagnosis of cervicogenic headache: a systematic literature review.

Malignant pilomatricoma: two new observations and review of the relevant literature.

Myiasis on a Giant Squamous Cell Carcinoma of the Scalp: A Case Report and Review of Relevant Literature.

Membership of the RCGP.

Examination for membership of the Royal College of General Practitioners (MRCGP). Introduction.

Calcaneus secundarius--a relevant differential diagnosis in ankle pain: a case report and review of the literature.

A review of Grey and academic literature of evaluation guidance relevant to public health interventions.

A model lacking relevant literature comparison.

Partial Anomalous Pulmonary Venous Connection Coexisting with Lung Cancer: A Case Report and Review of Relevant Cases from the Literature.

Epistaxis as the First Manifestation of Silent Renal Cell Carcinoma: A Case Report with Relevant Literature Review.

Solitary pulmonary metastasis from prostate cancer with neuroendocrine differentiation: a case report and review of relevant cases from the literature.

An evaluation of demographic factors affecting performance in a paediatric membership multiple-choice examination.

A proposed evidence-based shoulder special testing examination algorithm: clinical utility based on a systematic review of the literature.

Mandibular replacements: a review of the literature.

Cuboid syndrome: a review of the literature.

[Cerebellar dysarthria--a review of the literature].

Lipedema: A Review of the Literature.