#
2002 Martin Dunitz Ltd
International Journal of Psychiatry in Clinical Practice 2002 Volume 6 Pages 73 ± 81
73
Risk assessment for people with mental health problems: a pilot study of reliability in working practice TM GALE,1,2 A WOODWARD, 3 CJ HAWLEY, 1,2 J HAYES, 2 T SIVAKUMARAN1 AND G HANSEN 2 Int J Psych Clin Pract Downloaded from informahealthcare.com by Universitat de Girona on 11/19/14 For personal use only.
1
Department of Psychiatry, QEII Hospital, Welwyn Garden City; 2University of Hertfordshire; and 3Hertfordshire County Council Adult Care Services Department, Welwyn Garden City, UK
This paper describes a pilot study of reliability in the risk assessment of people with mental health problems. Specifically, we explore the evidence for professional and gender bias in ratings, in addition to the general level of agreement between raters. INTRODUCTION:
Six professional groups (psychiatrists, junior psychiatric doctors, nurses, community psychiatric nurses, social workers and occupational therapists) participated in the study and rated 159 patients on a nine-item scale which assessed different component s of risk. METHOD:
Contrary to some earlier work, we found no clear evidence that any one group consistently rated more extremely than any other group. Women were more cautious than men in their ratings, and this concurs with previous studies. Finally, a reliability study of randomly selected pairs of raters showed only moderate levels of agreement and, in some instances, the levels of disagreemen t were high enough to warrant concern. RESULTS:
Correspondence Address Dr Tim M Gale, Division of Psychology, University of Hertfordshire, College Lane, Hatfield, Herts, AL10 9AB, UK Tel: 01707 365383 Fax: 01707 335168 E-mail:
[email protected] Received 20 September 2001, revised 22 October 2001, accepted for publication 23 October 2001
These findings are discussed in the context of current risk assessment practice and the problems associated with investigating reliability in naturalistic settings and designing appropriate rating tools for risk. (Int J Psych Clin Pract 2002; 6: 73 ± 81) CONCLUSION:
Keywords risk assessmen t gender field study
INTRODUCTION
R
isk assessmen t can be defined as ``the qualitative or quantitative estimation of the likelihood of adverse effects that may result from exposure to specified health hazards or from the absence of beneficial influences’ ’.1 Recent changes in service provision for people with severe and enduring mental health problems, along with some highly publicized homicides, have raised concerns about the dangers of psychiatri c patients being treated in the community and this has, in turn, emphasized the importance of risk assessment for such individuals . Indeed, recent legislation and guidelines have reinforced the importance of thorough risk assessment and risk management within the Care Programme Approach.2 ± 4 Although risk assessmen t is now a routine part of mental health practice, the concept of `risk’ itself remains
profession reliability
problematic because it relates to such a potentially diverse set of outcomes. For example, Ryan points out that while many workers in the mental health field conceptualize risk in terms of actual physical harm, psychosocia l harm (e.g. loss of reputation, dignity or even accommodation) is a much more frequent outcome.5 Moreover, although individual health care providers and Social Services departments are required to routinely assess risk there is, as yet, no standard format for risk assessment . Community Mental Health Terms usually design their own assessmen t tools according to local needs, although our preliminar y assessment shows that the majority of these aim to capture information about the same general areas of risk (i.e. risk to self, risk to others, patient vulnerabilit y). Patients are usually assessed under a number of key issues and subsequently graded according to some `global’ status (e.g. `high’, `moderate’ , `low’ or `no’ risk). It is not clear
Int J Psych Clin Pract Downloaded from informahealthcare.com by Universitat de Girona on 11/19/14 For personal use only.
74
TM Gale et al
whether such descriptors are systematicall y derived from the individua l risk items using some kind of cumulative algorithm (as would be used, for example, for many psychiatri c rating scales), or whether the global judgemen t reflects a more holistic or qualitative approach by the assessor . Indeed, the fact that no standard scale has been consistently used to assess risk makes this a difficult question to address. Moreover, different assessors may ascribe different weights to differen t components of risk, and it is almost certainly the case that the response of one individua l to a risk assessment exercise will be coloured by their own experience rather than derived from epidemiological evidence about risk and outcomes.5 These issues raise important questions about the reliabilit y and validit y of risk assessmen t for people with mental health problems. If a rating tool is reliable, two raters assessing the same patient will agree with each other and, if the tool is also valid, one can have confidenc e that it accurately reflects what it purports to measure. But if risk assessmen t involves the use of different rating tools and, indeed, different criteria, how may reliabilit y and validity be properly assessed? Perhaps the limited coverage of these topics in the psychiatric, social services and health services literature attests to the difficultie s associated with researching these fundamental issues. Some researcher s have previousl y investigate d interrater reliabilit y for specific components of risk (as opposed to more global/abstracted descriptors) . For example, Montandon and Harding asked 193 raters to assess the `dangerousness’ of 16 individual s on a four-point scale ranging from 0 (no dangerousne ss) to 3 (extreme dangerousne ss).6 The case reports used in their study comprised individual s who were: 1. both violent and mentally ill (n=4); 2. violent but not mentally ill (n=4); 3. not violent but mentally ill (n=5); and 4. neither violent nor mentally ill (n=3). The level of agreement between the raters was generall y found to be quite poor (range 29 ± 86%), with a level of 60% being reached in only one quarter of cases; at worst, agreement only marginally exceeded chance levels (29% vs. 25%). Moreover, there was a considerabl e difference between profession s in the extent of dangerousne ss ascribed to an individual : psychiatrist s were more cautious than non-psychia tric doctors, penal justice professional s and other professiona l groups. The study illustrates, at least with respect to one particular component of risk, that different people may have very different perceptions and, moreover, that this may be partly determined by their professiona l background. The results also concur with those of earlier work which challenged the assumption that assessment s of violence and dangerous behaviour are consistent between raters.7 ± 9 Thus, even when risk is narrowly defined (i.e. by considerin g one specific characteristic ), the evidence suggests that agreement between different individual s can be quite poor and may reflect professiona l biases. Whether the professiona l biases reported in this study can be generalize d to other aspects of risk, however, remains open to question.
Ryan examined risk assessment associated with mental health problems in a questionnaire -based study that aimed to: 1. identify the principal factors underlyin g people’s concepts of risk associated with this group of patients; 2. investigat e whether demographi c factors had any influence on rating behaviour; and 3. establish whether interprofessiona l biases emerged in the assessmen t of risk.5 Using factor analysis, Ryan identified six core risk factors, of which five related to risks faced by a patient (vulnerabil ity, self-harm, dependency, underclass and medical disempowerm ent), whereas only one related to risks posed by them (threat). Importantly, Ryan found evidence for a strong gender effect in rating behaviour, with women perceiving greater risk under all six factors, regardles s of their professiona l group. Finally, for four of the six factors (threat, vulnerability , medical disempowerment and underclass) , psychiatrist s perceived the risk to be lower than did other professiona l groups. The latter finding is clearly at odds with Montandon and Harding’s finding of greater cautiousnes s among psychiatrist s, especiall y since the `threat’ factor would, theoretically, have a conceptual overlap with `dangerousn ess’.6 Thus, although these studies suggest that mental health professiona ls differ in their perception of risk, the data are inconclusiv e about the exact nature of such a difference . No studies, as far as we are aware, have investigated risk assessmen t of people with mental health problems in the field. This is probably because of the methodologica l and statistical difficultie s which arise when certain factors cannot be directly controlled or manipulated by the investigators . For example, Montandon and Harding presented each of their participant raters with the same set of patient scenarios in which the presence and absence of certain fundamental characteristi cs (e.g. a violent tendency) were carefully controlled.6 This kind of approach facilitates comparison between differen t raters because each rater receives exactly the same information, and, moreover, key factors which may be pertinent to the level of risk ascribed to an individua l can be easily manipulated by the investigators . Thus, the inter-rater reliabilit y of any two assessors can be measured for specific categories of patient. By contrast, data collected from routine clinical practice would not be subject to rigorous control and it would be difficul t to compare the reliabilit y between different assessors without post-hoc procedures. However, although Montandon and Harding’s approach is appealing on methodological grounds, there are several reasons why it may not reflect what happens in real working practice. First, assessors in controlled studies would have no prior knowledge of the patients described by the investigators . Thus, each case would be considered as though it were new, rather than allowing prior knowledge and experience to exert an influenc e upon the decision process. Second, the type of scenarios chosen for inclusion in laboratory studies may reflect characteristic s that are of contemporary interest (e.g. violence, risk of suicide, etc.) rather than representing a true caseload.*
Int J Psych Clin Pract Downloaded from informahealthcare.com by Universitat de Girona on 11/19/14 For personal use only.
Risk assessment for people with mental health problems
Careful control of key characteristic s may create stereotype scenarios which, in turn, may exaggerat e the level of interrater agreement above normal levels. Finally, it is arguable whether making an assessment in which no consequence can result from exercising poor clinical judgemen t is comparable to the kind of decision-maki ng that occurs in real clinical settings, where accurate risk assessmen t can be a matter of life or death: for example, an assessor may be much more cautious if the assessmen t has potential ramification s for a client or the public. Although field investigation s of risk assessment may present methodologica l challenges , it is timely that uncontrolled data investigated and compared them with data from laboratory studies. The current study explores use of a methodology for assessing inter-rater reliabilit y of risk assessmen t in the field. We would emphasize that this is a pilot investigation and that the feasibility of the methodology is of equal interest to the data generated. Using rating data derived from routine assessments , we aimed to explore: 1. difference s in rating behaviour between the different professiona l groups; 2. the average level of agreement between randomly selected pairs of raters; and 3. whether the gender of the rater predicted greater or less caution in assessing risk (after Ryan5).
METHOD MATERIALS A risk assessmen t questionnair e was designed for the purpose of this study. The items in the questionnair e were chosen after studying a number of similar proformas used to assess risk in other units within our region. The majority of these addressed three general domains of risk: 1. risk of suicide; 2. risk of violence; and 3. vulnerabili ty/risk of neglect; and, in general, assessment s were made using ordinal scales measuring increasing levels of risk, and/or
75
unstructure d free-text responses. The questionnair e used in this study was designed to capture the essence of other proformas rather than to be a model rating tool, since we are concerning ourselve s here with the reliabilit y of the risk assessmen t process rather than that of a specific assessmen t scale. In order to generate data that would be suitable for reliabilit y analyses, we used ordinal scales for each risk element (e.g. no risk, minimal risk, moderate risk, high risk, maximum risk). However, for some particular items, we imposed percentages on the scale (e.g. 0%, 25%, 50%, 75%, 100%) in order to ensure that all raters employed similar anchor points for assessing the concept of likelihood.{ Table 1 lists the questions. Each item was rated on a five-point ordinal scale. In order to discourage response bias, the order of these scales was alternated so that, for some items, it increased (e.g., 1 ± very unlikely, 2 ± unlikely, 3 ± moderately likely, 4 ± likely, 5 ± very likely) while for others it decreased. However, when scoring items for analysis, 1 was always used to indicate low risk/low caution and 5 was always used to indicate high risk/high caution. In addition to completing the risk questionnaire , raters were also asked to complete a Clinical Global Impressio n (CGI) scale10 to assess the general severity of each patient’s illness.
SUBJECTS Patients A consecutive series of patients was assessed using the proforma. The sample comprised 159 (44.4% male, 55.6% female) clients who fell under the care of one of the clinical teams within a NHS Trust and who were undergoin g routine risk assessmen t under the local Care Programme Approach. All clients were aged between 16 and 65 (mean 37+13 years) and were being seen by a psychiatris t and at least one other Community Mental Health Team professiona l (i.e. these patients were all relativel y complex cases: simple
Table 1 The risk assessment questionnaire 1. 2. 3. 4. 5. 6. 7. 8. 9.
In your opinion, how likely is it that this client will present in A&E over the next 12 months having self-harmed? In your opinion, how likely is it that this client will harm another person within the next year? If this were to occur, how serious might such harm be? How likely is it that this client will suffer at the hands of others within the next year? How would you rate the risk of this client making a serious suicide attempt over the next year? How well do you expect this client to comply with medication over the next year? In your opinion, how likely is it that this client will be admitted/readmitted to hospital for psychiatric care over the next year? Considering all possible unwanted outcomes, how far do you think the current care plan influences that overall risk? Considering how this client has been over the past week, how would you feel about visiting them on your own in their home?
*Even if the investigators selected real-life case scenarios, the selection process itself might render the scenarios non-representative.
{Although, for analytical purposes, the percentages were still scored on a five-point ordinal scale
76
TM Gale et al
Int J Psych Clin Pract Downloaded from informahealthcare.com by Universitat de Girona on 11/19/14 For personal use only.
outpatient cases with non-severe problems are not represented in this sample). The mean (+SD) CGI score was 3.9 (+1.17, min 1, max 7), which is indicative of moderate illness . The majority of patients (149/159) were white. Raters The raters represente d the following six professiona l groups: 1. psychiatrist s (P); 2. junior psychiatric doctors (JPD); 3. nurses (N); 4. community psychiatric nurses (CPN); 5. occupational therapists (OT); and 6. social workers (SW). In total 633 ratings were made for the 159 patients (i.e. four ratings on average per team/patient; min 2, max 8). There was no standard distribution of professional s or genders at each patient assessment .} Representation of the different profession s is displayed in Table 2. There was some variation between professiona l groups, both in terms of the number of assessment s at which each was represente d and in terms of the number of individual s representing each profession . This is inevitable given that this was an uncontrolled study. It is notable that psychiatrist s were present at more assessment s than any other professiona l group, and this is probably because the majority of cases were relatively complex and would necessaril y require the involvement of a consultant.
PROCEDURE Each patient was rated by a team of at least two people, and the composition of the assessmen t team was not manipulated in any way. Assessment s were made with referenc e to both prior knowledge and case notes. At each assessment , relevant demographi c data were recorded for the raters, who used identificatio n codes to protect anonymity. All professional s were asked to complete the proforma without consulting with their colleagues and, for this reason, the assessment s were completed at the start of each team meeting, before any group discussion . Data were collected over a total period of 18 weeks.
ANALYSES The methodologic al problems associated with uncontrolled field studies pose particular difficultie s for the way in which the data are analysed. In this study, it was possible to compare the ratings of any two individual raters within a single assessmen t because they would both have access to the same information about the patient. However, to compare the ratings of individual s from different teams would be less straightforwa rd because the different patients being assessed by each team might vary markedly. To accommodate this problem, we devised a method of representing each rater’s score as a proportion of the average team score. This is illustrated in Table 3 for a hypothetical team of five raters. The score for item 1 (from our risk assessmen t questionnaire ) is presented in column 3 for each member of the assessment team. The average score for the whole team is then calculated (indicated in column 4) and each individual’ s score is then represented as a proportion of the group average (in column 5, by dividing the individua l score by the group mean score). This is a form of score standardizati on, somewhat akin to a z-score, that allows comparisons to be drawn between ratings for different patients, regardles s of individua l difference s between patients. A z-score would not be appropriat e here because the number of assessors within each team would be insufficien t to generate a normal distribution, and hence a standard deviation, of scores.* In the example of Table 3, the psychiatris t gave a rating which was one and a half times the group average, whereas the social worker’s score was only half the group average. On their own, these weighted (or relative) scores are not particularl y informative . However, when the relative scores for each professiona l group are averaged across every patient assessment , the mean score will indicate whether particular professiona l groups tend to rate more or less cautiously than any others. If all groups tend to rate similarly to each other, the average relative score for each group will approximate to 1. All weighted item scores for each professiona l group were entered in an analysis of variance (ANOVA) model.*
Table 2 Representation of the different professions in the assessment teams
Profession and no. of individuals Psychiatrist (6) Junior psychiatric doctor (10) Nurse (15)
No. of assessments including at least 1 member of this professional group (max 159) 134 98 97
}This was a study of risk assessment in the field and so the ratings groups were not manipulated in any way.
*A z-score also takes into account the standard deviation of the data, which makes it possible to compare measures from completely different scales. Our measure, by contrast, does not permit this kind of comparison. One potential problem arising here is that the relationship between an individual’s score and the mean team score is not linear. For example, if the mean team score is 2 and a particular assessor’s score is 3, the individual score is 1.5 times as large as the group mean. However, if the mean team score is 4 and a particular individual scores 5, his/her score is only 1.25 times as large as the group mean, even though the absolute difference is 1 in both instances. With regard to comparisons between professional groups and gender, this would only be problematic if there was a considerable variation in the average item score for any one profession. However, two one-factor ANOVAs comparing levels of (a) professional group and (b) gender, showed no overall differences in item scores (F51 for both).
Int J Psych Clin Pract Downloaded from informahealthcare.com by Universitat de Girona on 11/19/14 For personal use only.
Risk assessment for people with mental health problems
We also investigated whether a gender bias was present for any of the item scores. Again, in order to overcome the problem of individua l patient variability , we utilized weighted item scores for men and women. Finally, we randomly selected pairs of ratings (each pair was taken from within the same rating team irrespectiv e of professiona l group) and tested inter-rater reliabilit y for the total scores (i.e. the sum of scores across all nine items on the scale). For this analysis, it was not necessary to use weighted (relative) scores because the comparisons were within each rating group rather than between. The reason for using the summed scores is that, although it is contentious whether several ordinal scales can be added together to provide useful summary data, this practice is nonetheless common to many psychiatric rating scales, and it would therefore be informative to explore whether interrater agreement is similar to that observed for other scales.
RESULTS PROFESSIONAL BIAS IN THE ASSESSMENT OF RISK The six professiona l groups were compared on each risk item and the data were entered into a one-factor ANOVA model
Table 3 Calculation of weighted item scores for each rater
Table 4 The relationship of professional group ratings to the assessment group average
(one model per item, nine in total). Table 4 displays the general relationship to the assessment team average for each professiona l group, for each risk item. A score of 1 indicates that, on average, a profession’ s ratings are typical of those of the assessmen t team as a whole, whereas a score considerabl y greater or less than 1 indicates that a profession’ s ratings tend to be extreme relative to other members of their assessmen t team. So, for example, an average relative score of 1.2 would indicate that a particular profession’ s ratings were 1.2 times higher (i.e. 20% higher) than the average rating. There was an overall significan t difference between the professiona l groups for all items, except item 7 (F42.5, P50.02 for all items except item 7). The most extreme weighted (relative) score for each item is indicated in bold type for clarity. It is evident that no single professiona l group consistently generated the most extreme ratings relative to the other members of the assessmen t group. Indeed, every profession’ s ratings were extreme for at least one item. This pattern is confirmed by examining the overall mean weighted scores in the last row of Table 4, which are broadly similar for each professiona l group. There appears to be no common factor which would associate extreme rating with any one group. For example, nurses were the most extreme raters on items 1 and 8, where they generated ratings that were, on average, 21% and 35% higher than the overall group mean respectively .
Rater
Profession
1 2 3 4 5
Nurse Psychiatrist Occupational therapist Social worker Social worker
Item
Item 1 score
Average score for item 1
Weighted item score
2 3 2 1 2
2 2 2 2 2
1 1.5 1 0.5 1
P
JPD
N
CPN
OT
SW
1 2 3 4 5 6 7 8 9
0.87 0.982 1.057 1.002 0.824 1.079 1.021 1.018 1.243
0.957 0.986 1.027 1.121 0.987 1.042 1.018 0.907 1.091
1.206 1.068 0.91 1.048 1.179 1.035 1.164 1.351 0.947
1.099 1.086 0.965 0.927 1.034 0.991 1.057 0.899 0.872
0.845 0.932 1.186 0.918 0.885 1.129 1.129 1.102 1.063
1.103 1.058 1.058 1.019 1.151 1.049 1.173 0.837 0.836
Mean
0.989
1.009
1.075
0.945
1.03
0.921
*Although, strictly speaking, ANOVA should not be used for ordinal data, the weighted item scores approximated to interval data and, moreover, there were at least 30 weighted scores for each group, which provides a sample of sufficient size to use this test when assumptions of normality are violated.11
77
78
TM Gale et al
However, these items are not similar and do not appear to relate to a more global aspect of risk (e.g. risk of harm to self, risk of harm to others, etc.). A similar situation is evident for occupational therapists regarding items 3 and 6{. Thus, although there appears to be considerabl e interprofessiona l variabilit y across risk items, this is not easily accounted for in terms of attitudes toward particular aspects of risk. Given that, by definition, there must be one group who will be extreme on any one item, along with the fact that no single group was extreme on more than two items, it would seem that Table 4 reflects random scatter rather than any systematic bias on particular items.
Int J Psych Clin Pract Downloaded from informahealthcare.com by Universitat de Girona on 11/19/14 For personal use only.
GENDER BIAS IN ASSESSMENT OF RISK On average, women assessed risk to be higher than did men for 4/9 (44.5%) items. For 4/9 items there was no gender difference and, for one item (11%), men perceived risk to be higher than did women (Table 5). Interestingl y, there was also a gender difference in CGI scores, with female raters tending to classify patients as being more ill than did male raters (P50.01). This may have some bearing on the female tendency to ascribe greater risk/caution.
RELIABILITY OF RATINGS BETWEEN RANDOMLY SELECTED PAIRS OF RATERS Pairs of raters were randomly selected from each of the 159 assessmen t teams, irrespectiv e of their identity, gender or profession . The scores across all nine items were added together and these totals were randomly assigned to the categories `rater 1’ and `rater 2’. Thus each category contained one randomly selected rating for every patient assessed during the study. The mean total score for `rater 1’ was 19.9+5.1 (range 9 ± 34) and for `rater 2’ was 20.2+5.2 (range 9 ± 37). Pairs of total rating scores are illustrated in
Figure 1. There was a significan t correlation between raters (r=0.645, P50.0001) although this only actually accounts for approximatel y 42% variance in scores. Figure 2 shows the distribution of difference scores between pairs of raters assessing the same patients. As would be expected, this distribution is approximately normal (mean=70.32, median=0, SD=4.3). It is of some concern to note that the maximum discrepancie s between raters were as large as 12 points, with 4 ± 5 point discrepanci es being not uncommon. Although there was a moderately good correlation between total scores, this does not necessaril y indicate that raters agreed about most issues. For example, it is possible that two raters might ascribe very different scores to the same items yet still generate the same total score. For this reason, we calculated the average discrepanc y between the pairs of raters, for each of the nine items (Table 6). The median discrepancy did not exceed one point for any item although there was some variabilit y between items in terms of both the mean and the dispersion of discrepanc y scores. Given that maximum possible discrepanc y on any one item was four points, the overall mean difference (averaged across all items) is approximately 19% of the possible maximum. It is worthy of note that every item produced at least one discrepancy of three or four points (i.e. 75% or 100% of the maximum discrepanc y between raters), demonstrating that, on rare occasions, the level of agreement for single items was very poor indeed.
DISCUSSION In this paper we have investigate d inter-rater reliability, inter-profess ional variabilit y and gender-base d bias in risk assessment . In contrast to other work in this area, our data were collected during working practice, rather than under
Table 5 Gender differences with respect to individual items Item
Topic
1 2 3 4 5 6 7 8 9
Likeilihood that client presents at A&E after self-harming Likelihood client harms another person Seriousness of harm to another person Likelihood of client suffering at hands of others Likelihood of suicide attempt Likelihood of client complying with medication Likelihood of client being readmitted for psychiatric care Effect of current care plan on risk Perceived risk associated with a home visit
{Although given the small number of OTs who participated in this study, the effect of particular individuals may be particularly strong.
Higher weighted score
P-value
Women Women Neither Neither Women Neither Women Neither Men
50.0001 50.03 50.12 50.28 50.0001 50.33 50.002 50.06 50.001
Int J Psych Clin Pract Downloaded from informahealthcare.com by Universitat de Girona on 11/19/14 For personal use only.
Risk assessment for people with mental health problems
79
Figure 1 Inter-rater reliability for summed score across all nine risk items
Figure 2 The distribution of differences between Rater 1 and Rater 2 totals
controlled conditions. Clearly, this approach has some methodologic al disadvantage s, but these might be offset by the additional validity gained through using real-life data. We would not view this work as being anything beyond the level of a pilot study: indeed, as we have already indicated, our objective was as much to test whether it is possible to extract meaningful reliabilit y data from uncontrolled risk assessment s as to test reliabilit y levels per se. So, before we discuss the implications of our findings, we will first turn to some of the methodologica l issues associated with our approach and consider whether uncontrolled studies of risk assessmen t are really viable. One obvious disadvantage with this kind of data, particularly with respect to inter-profess ional comparisons ,
is the variability in representatio n of the different professiona l groups. Not unrelated to this is the variabilit y in the influenc e of particular individuals , particularly in those groups which have only a few representative s (e.g. occupational therapists). This would certainly be a valid criticism of our current study, but not of the approach in general. For example, if the study were repeated using data from ten or more psychiatric units, the issue of small numbers would not be a problem, even for those professiona l groups who were only involved in a minority of risk assessments . In comparison with controlled studies, then, one might just argue that a much larger sample size is needed, but that this disadvantag e can be offset by collecting data as part of routine practice, which would
80
TM Gale et al
Table 6 Discrepancies between raters for individual items Mean difference (min 1, max 5)
Median difference
SD difference
1 2 3 4 5 6 7 8 9
0.94 0.52 0.66 0.87 0.84 0.66 0.68 0.79 0.80
1 0 1 1 1 1 1 1 1
0.89 0.74 0.76 0.95 0.81 0.69 0.71 0.83 0.98
Mean
0.75
0.89
0.82
Int J Psych Clin Pract Downloaded from informahealthcare.com by Universitat de Girona on 11/19/14 For personal use only.
Item
both capitalize on existing resources and be representativ e of real-life risk assessmen t processes. Thus, a larger scale follow-up study would not necessaril y be expensive (cf. a laboratory-ba sed study). The statistical issues associated with this kind of data are also worthy of discussion. In this study, we abstracted weighted (or relative) scores to describe the extremes of ratings for a particular group. This was an attempt to synthesize data from a number of different raters and patients to produce a measure indicative of more general aspects of rating behaviour. This is a novel approach but one which may have some merit for exploring uncontrolle d data of this nature. Ideally, one would utilise z-scores as a standard measure of group position but, in working practice, this would not be possible unless the number of raters assessing a single case were large enough to generate a normal distribution of scores (our estimate would be a minimum of 10 but, optimally, one might need at least 25). Our measures, by contrast, can be applied when the number of raters is small. However, we would stress that these weighted scores only become informative when they are averaged across a large number of different assessments . Of course, the use of non-ratio data to generate proportion scores is open to some criticism: in our questionnair e we imposed ratio-type scales (i.e. percentages) for some items, but for most items we used numerical scales to represent increasin g levels of concern/ caution. This taps into the much broader issue of how rating scale data, in general, should be analysed. Many psychiatri c rating scales ± for example, the Yale-Brown Obsessive Compulsive Scale,12 the Montgomery- AÊsberg Depression Rating Scale13 and the Beck Depression Inventory 14 ± comprise individua l items which are scored using ordinal scales. It is rare to see studies which present data on individua l items from these scales, but it is notable that the vast majority of studies using these scales do assume parametric qualities for the summative scores across all items. Whether it is reasonable to assume that several ordinal scales can be added together to generate
ratio data is beyond the scope of this article, but we would point out that statistical violations may be inherent in any study which uses numerical scales to rate behavioural qualities. Turning now to the results of this study, we found little evidence for consistent professiona l bias in risk assessment . Although each professiona l group produced extreme ratings on at least one item, there was no one group which always perceived risk to be particularly high or low. Previous work in this area has largely focused on psychiatrists , and there have been conflicting reports about whether this professional group tends to be more or less cautious than others.5,6 In this study, we found that psychiatrist s tended to give ratings that were average for their assessmen t group, and we found little evidence that they were routinely cautious (or incautious, for that matter). This may be because we included questions covering a broader range of risk aspects than other studies. For example, Montandon and Harding6 only considered the `dangerousne ss’ of the patient. Our finding that psychiatrist s were more cautious about home visits may be consistent with these authors’ data, but there was no evidence that this could be generalize d to other aspects of risk. Had we included fewer items in our questionnair e, any inter-profess ional difference s we reported might have appeared to carry more weight. However, this might have led to premature conclusions , since in considerin g the `extremity’ of any group for a specific component of risk we must remember that no group was `extreme’ on more than two out of nine items. Our finding that women tended to be more cautious in assessing risk accords with the work of Ryan.5 However, once again, this trend was not apparent for all aspects of risk, although it is notable that three of the four items which elicited significantl y greater caution in female raters related to client well-being rather than potential harm to others (by the client). At this point it is worth making some comment on the specific questionnair e designed for this study. As we have already discussed, our aim was to test a general principle rather than to validate a specific rating tool: the fact that no standard tool is routinely used for making this kind of assessmen t led us to design a proforma which reflected the kind of questionnaire s used in other units. Thus, we would argue that our data should be generalizab le to the kinds of risk assessmen t proforma that are more widely used.* Whether the scale we used is predictive of actual outcomes is another issue and one which would need to be followed up in a much larger, longitudinal study. We investigated the reliabilit y between randomly selected pairs of raters, using the total score across all items and the individua l item scores. Reliabilit y for the former was moderate, but not as great as would be expected for other psychiatric scales and, for useful comparative purposes, certainly below the level of agreement that would be *Indeed, our proforma was designed after looking at a range of questionnaires used by other units in our geographical region.
Int J Psych Clin Pract Downloaded from informahealthcare.com by Universitat de Girona on 11/19/14 For personal use only.
Risk assessment for people with mental health problems
acceptable in clinical trials.* Agreement between raters on individua l items was not particularly high either and there were some instances when disagreemen t was maximal. Given that reliabilit y is a pre-condition for validity, this raises doubts about how easy it will be to design valid risk assessmen t tools. In this context, it is of concern that the Mental Health Policy Implementation Plan15 prescribes risk assessmen t proformas to be drawn up on a local basis by every team; a great deal of effort may be expended in designing hundreds of different tools, few of which will be reliable, let alone valid. Although there is little empirica l evidence to suggest that a global risk may be `extracted’ by adding together several measures of qualitatively different risk aspects, there is no empirical evidence, either, in favour of an alternative method. Perhaps this indicates the lack of clear scientific principles underlyin g risk assessmen t in general: indeed, for such a timely and important issue, there is a surprisin g lack of good evidence to back current practice.
CONCLUSION In conclusion then, our data do suggest that different individual s tend to perceive risk differently , although this does not fundamental ly relate to professiona l background. Gender appears to be more of an issue and (especiall y since this finding replicates that of previous work) perhaps this should be taken into account in working practice. At any rate, we would suggest that risk assessmen t is something *A correlation value of 0.6 or less would be unacceptably low, so our value of 0.645 indicates that agreement barely exceeds minimally acceptable levels: by way of comparison it may be useful to note that depression rating scales routinely achieve interrater reliability values of 0.8 or more and, not infrequently, 0.9.
81
which can only benefit from adopting a multi-discipl inary approach, rather than giving responsibilit y to one individual. As far as this particular methodologica l approach is concerned, we would suggest that a larger study involving many different centres would be both viable and informative.
ACKNOWLEDGEMENTS The authors would like to thank the Associatio n of Directors of Social Services (ADSS) for supporting this work by way of a grant awarded to the second author.
KEY POINTS . This paper presents a novel approach to investigating reliabilit y of risk assessmen t for mentally ill patients . Although we found considerabl e variation between individua l raters, there was little evidence to suggest that specific professiona l groups were particularly cautious or incautious when assessing risk . Female raters showed some tendency towards greater caution when assessing risk . Inter-rater reliabilit y only reached moderate levels . These findings have implications for investigatin g risk assessmen t in naturalistic settings and may also inform the design of future risk assessmen t tools
REFERENCES 1. Last JM (1998) Dictionary of Epidemiology. Oxford University Press. 2. Department of Health (1994) Guidance on the discharge of mentally disordered people and their continuing care in the community (White Paper). London: DOH. 3. Department of Health (1995) Building Bridges: A guide to arrangements for inter-agency working for the care and protection of severely mentally ill people (White Paper). London: DOH. 4. Mental Health (Patients in the community) Act 1995. London: HMSO. 5. Ryan T (1998) Perceived risks associated with mental illness: Beyond homicide and suicide. Soc Sci Med 46: 287 ± 97. 6. Montandon C, Harding T (1984) The reliability of dangerousness assessments: A decision making exercise. Br J Psychiatry 144: 149 ± 55. 7. Frederick CJ (1978) Dangerous Behaviour: A problem in law and mental health. Washington DC: US Department of Health, Education and Welfare. 8. Pfohl P (1978) Predicting Dangerousness. Lexington, MA: Lexington Books.
9. Wenk E, Robinson J, Smith G (1972) Can violence be predicted? Crime Delinquency 18: 393 ± 402. 10. Guy W (1976) ECDEU Assessment Manual for Psychopharmacology (US Dept. of Health Education, and Welfare publication ADM 76-338) pp. 217 ± 22. Rockville, MD: National Institute of Mental Health. 11. Pagano R (1990) Understanding Statistics in the Behavioural Sciences (3rd edn.). St Paul, MN: West Publishing Co. 12. Goodman WK, Price LH, Rasmussen SA et al (1989) The YaleBrown obsessive-compulsive scale: Development, use and reliability. Arch Gen Psychiatry 46: 1006 ± 11. 13. Montgomery SA, AÊsberg MA (1979) A new depression scale designed to be sensitive to change. Br J Psychiatry 134: 382 ± 9. 14. Beck AT, Ward CH, Mendelson M et al (1961) An inventory for measuring depression. Arch Gen Psychiatry 4: 53 ± 63. 15. Department of Health (2000) Mental Health Policy Implementation Guide. London: DOH.