Psychological Review 1991, Vol. 98, No. 2,155-163

Copyright 1991 by the American Psychological Association, Inc. 0033-295X/91/$3.00

A General Model of Consensus and Accuracy in Interpersonal Perception David A. Kenny University of Connecticut

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Consensus refers to the extent to which 2 judges agree in their ratings of a common target. A general model of interpersonal perception based on Anderson's (1981) weighted-average model is developed. The model shows that increased acquaintance does not always lead to large changes in consensus. Degree of overlap between the target behaviors observed by the judges and similarity of meaning systems are key but neglected parameters. The model can also be used as a basis for determining the accuracy of person perception. In some cases, accuracy can increase with greater acquaintance, whereas consensus may not.

A fundamental topic in psychology is the extent to which two observers or judges agree with one another in their impressions of a common target. Imagine two people, Mary and Susan, who are each asked to judge on a 9-point scale how friendly Helen is. At issue is the extent to which Mary's rating of Helen agrees with Susan's rating of Helen. This question is often referred to as agreement, but because people can agree about nonsocial objects, the term consensus is preferable. Consensus has a different but related meaning in attribution theory that is not intended in this article. Quite clearly, the consensus question is closely related to the accuracy question. It should, however, be realized that consensus does not imply accuracy. A father and a mother can agree that their newborn child will win a Nobel Prize, but the child is not likely to win that prize. Generally, however, accuracy implies consensus. At the limit, this must be case. That is, if two people are both exactly accurate, they must be in consensus. In social perception, exact accuracy is rare, and only partial accuracy is the norm (Kenny & Albright, 1987). It is both theoretically (Hastie & Rasinski, 1988) and empirically (Kenrick & Stringfield, 1980) possible for two judges not to agree, with both being partially accurate. If there are independent sources of variance to which the two judges have differential access, they can disagree, yet both can be partially accurate. So technically, consensus is neither a necessary nor sufficient condition for accuracy. However, the consensus and accuracy questions are linked. In this article, the primary focus is consensus, but accuracy is considered in a later section. Some presentations of consensus mix self-peer (agreement between a person's view of a target and that target's view of himself or herself) with peer-peer (agreement between two people about a third person) agreement studies. Although there are

probably great similarities in the two sets of relationships, there are theoretical and empirical reasons (John & Rothbart, 1988) to expect differences. Thus, in this article, only peer-peer agreement is considered. Historical Review Historically, there are three interwoven traditions that have studied consensus. They are research in (a) person perception, (b) personality, and (c) observer ratings. These three areas are briefly reviewed.

Person Perception Within the field of social psychology, there has been considerable interest in the extent to which person perception is driven by the stimulus or by internal processes. This question has been described in various ways. For instance, researchers following Dornbusch, Hastorf, Richardson, Muzzy, and Vreeland (1965) have referred to the eye of the beholder. Higgins and Bargh (1987) discussed the issue of whether social perception is datadriven versus theory-driven. Research in the area of consensus has been very important in this debate. The most influential study in this area is a study by Dornbusch et al. (1965). They studied an unspecified number of 9- to 11-year-old summer-camp residents. The children had known each other for only 2 to 3 weeks. Each child was asked to "Tell me about ?" These free descriptions were coded into 69 categories. Dornbusch et al. (1965) concluded that "the most powerful influence on interpersonal description is the manner in which the perceiver structures his interpersonal world" (p. 440). The Dornbusch et al. (1965) study was followed up by Bourne (1977) and Park (1986). Bourne, using only 17 subjects, replicated the Dornbusch et al. study. Instead of using relatively unacquainted children, he used well-acquainted adults, and he used rating scales as well as free descriptions. Like Dornbusch et al., Bourne found low levels of consensus. Park (1986) examined consensus in a more controlled environment than previous studies and found higher levels of consensus than the earlier studies. Work by Touhey (1972) and Rozelle and Baxter

This research was supported in part by grants from the National Institute of Mental Health (RO1-4029501) and the National Science Foundation (BNS-8807462). I thank Bernadette Park, Charles Judd, Deborah Kashy, Bryan Hallmark, and Robert McCrae, who provided comments on a draft version of this article. Correspondence concerning this article should be addressed to David A. Kenny, Department of Psychology, University of Connecticut, Storrs, Connecticut 06269-1020.

155

156

DAVID A. KENNY

(1981) showed that experimental instructions to increase the subject's motive to be accurate also increased consensus.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Personality Personality psychologists have relied primarily on self-report inventories to measure individual differences. However, as reviewed by Wiggins (1973), some researchers (e.g., Cattell) have supplemented self-report measures with peer ratings, and others (e.g., Norman) have primarily relied on peer-rating inventories. Over the last decade or so, there has been a virtual explosion of personality peer-rating studies (Kenrick & Funder, 1988). To a large extent, these studies have been undertaken in response to the attack on personality. For instance, Bourne (1977) claimed to show that judges were not in consensus and so personality did not exist. Subsequently, numerous personality researchers (Funder, 1987; Kenrick & Funder, 1988; Kenrick & Stringfield, 1980; Moskowitz & Schwarz, 1982) have argued that consensus in personality judgments demonstrates an empirical basis for individual differences. Observer Ratings Psychologists have long used observers to rate or code social behavior. As biologists use mass spectrometers and chemists use electron microscopes, the most valued "instrument" used by psychologists is the human observer. Initially, classical test theory was applied to human observers who were treated as if they were items on a test. It became clear that human observers were subject to leniency and halo biases that were less problematic with nonhuman methods. Models of the rating process were proposed by Guilford (1954) and Stanley (1961) and were in large part subsumed by Cronbach, Gleser, Nanda, and Rajaratnam's (1972) generalizability theory. A second tradition, besides classical test theory, is the multitrait-multimethod matrix (Campbell & Fiske, 1959). The rater is treated as a method of measurement in the multitrait-multimethod matrix. Method variance is interpreted as halo effect within the multitrait-multimethod matrix. Consensus between a pair of raters is called convergent validation within this tradition. Kane and Lawler (1978) have studied the extent to which peer ratings and rankings can be used to measure psychological constructs. Researchers in areas other than personality have used peer ratings as a basic measure. Developmental psychologists interested in popularity, social skills, and social withdrawal have used peer ratings. Also, in industrial psychology, performance appraisal often involves having more than one person evaluate a given target. The Weighted-Average Model In this section, a general mathematical formulation of the various factors that determine the level of consensus is presented. The model can be used to model the level of consensus as well as accuracy in person perception. The major focus in this article is on consensus.

Model Parameters Before presenting the formal model, the six factors that determine consensus are denned. Acquaintance. Acquaintance is simply taken to mean the sheer amount of information to which the judge is exposed. Presumably, the extent to which two judges see more of a target's behaviors, the more they will agree. Overlap. To what extent do two judges observe the target at the same time? That is, to what extent do the judges observe the same set of target behaviors? The overlap factor, which has been largely ignored, plays a pivotal role in determining the degree of consensus. Shared meaning systems. To what extent is an act given the same meaning by two judges? That is, if two judges see a target engage in a behavior, to what extent do they label that behavior in the same way? Consistency. How consistent is the target's behavior? If the target is friendly in one situation, will the target be friendly in another situation? Historically, personality researchers have taken consensus to mean that the target's behavior is consistent, but consensus can exist even when the target's behaviors are not consistent. Extraneous information. To what extent does the judge rate the target on the basis of extraneous information, that is, information not based on the target's acts? Communication. To what extent do the judges share with each other their impressions of the target? Two judges can be in consensus because they communicate their impressions to each other. It is possible to relate all six of these factors in a single mathematical model. A modified version of Anderson's (1981) weighted-average model is used. Before beginning, a brief review of the weighted-average model is presented. Imagine that a judge knows three facts about a target: She is a librarian, she is politically conservative, and she likes to dance. The judge is asked to rate how extraverted the target is. According to the weighted-average model, each fact has a scale value. The scale value, symbolized by i, states the impression that judge would have of the target if there were no other information. Presumably for extraversion, being a librarian would have negative scale value, whereas liking to dance would have a positive scale value. Also associated with each piece of information is a weight, which is usually symbolized by w. The weights multiply the scale values and represent the importance or salience of the information. The weighted-average model states that the impression that a judge has of the target equals the sum of each weight multiplied by the scale value, and this sum is divided by the sum of the weights. This weighted-average model is applied to the measurement of consensus. In the model there is a target who engages in a series of behavioral acts. The definition of an act is left somewhat vague but includes verbal acts, nonverbal behavior, and physical appearance information. The acts are designated as ^s. There is a judge who observes a subset of the ,4s and is asked to judge the target on a given trait. Each act is then given a meaning or in terms of Anderson's (1981) model, a scale value. Two judges may attach different scale values to the same act.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

CONSENSUS AND ACCURACY

For instance, they could observe the same act (e.g., Dave losing his keys), and one might infer that the act was dispositionally caused (Dave is forgetful), whereas the other might infer that the act was situationally caused (the keys fell out of Dave's pocket). The judge's impression of the target is also influenced by the unique impression, which represents that part of the judge's impression not caused by the target's acts. For instance, the judge may be favorably disposed to the target because the judge is in a good mood that day or because the judge believes that all targets tend to have a high standing on the trait. The unique impression in this model is a broader concept than the initial impression in Anderson's (1981) formulation. In this model, the unique impression represents all the information that a judge uses that is not based on the target's behavior. In Anderson's model, the initial impression represents the impression that a judge has of a target based on no information. The unique impression can change over time, whereas the initial impression cannot change. The judge's impression of the target is assumed to be a weighted average of the scale values for the behaviors that the judge observes plus the unique impression. It is assumed that each of the acts is equally weighted. The equal-weighting assumption is made to simplify an already complex model. Ideally, later work with the model will relax the equal-weights assumption. Figure 1 represents the model. There is 1 target who engages in four acts. Of course, real targets engage in many more acts in a few minutes. For illustration purposes, a small number of acts was chosen. The acts are labeled At through At. Two judges observed a subset of the target's acts. The first judge views acts At and A2. The second judge observes A2 and A3. Act A4 is observed by neither judge. Each judge rates the target. To do this, a meaning or a scale value 5 is attached to each act observed by each judge. Note that although both judges observe A2, they each attach a different meaning to that act. Also, each judge forms a unique impression of the target, denoted by s,0 for judge /'. The scale values for the acts observed by the judge and the unique impression combine to form that judge's impression, /. The scale values for each act are weighted equally. The judges then communicate to each other their impressions, and they mutually influence one another. This mutual influence is represented by the paths labeled a. The value of a can be negative; for instance, if the judges dislike each other, they might negatively influence one another. However, in most applications, the value of a is positive. The correlation between two judges' impressions across a set of targets is a function of the following six parameters: First is n, the number of acts that each judge observes, which is assumed to be the same for each judge. Second is q, the proportion of acts for the two judges that overlap. So if n is 40 and q is .75, then the two judges see nq, or 30, of the same acts. In Figure 1, q is .5 because each judge sees two acts, one of which is the same. Third is p,, the degree to which a target's behavior is consistent. It measures the extent to which the same judge gives similar scale values to two different acts. Fourth is p2, the correlation between two judges' scale values for the same act. This parameter measures the similarity between the two judges' meaning systems. In Figure 1, it is the correlation between s,, and s2l. Fifth is k, the weight for the unique impression. Finally, sixth is

157

a, the degree to which the judges influence one another. To fix the metric of the scale values and the unique impression, their variances are set to 1, as shown in the following weighted-average equation: 7, = [(kslo + 2 sv)/(k +

aI2.

(1)

What is interesting is that with the exception of a, all of the parameters can be put into a single equation that states the degree of consensus, which is denoted r. Imagine a set of targets, each of whom is judged by the same pair of judges. Judges and targets are randomly chosen. The statistic r is defined as the correlation between the two judges' ratings of a common set of the targets. Equation 2 is the equation for the consensus correlation, or r, with no communication effect (a = 0): qnp2(\ k2 - p,) +

(2)

The formula is quite complicated despite many simplifying assumptions. The effect of the communication parameter can be assessed. As shown in Equation 3, the value of r, the consensus correlation, is adjusted by a, the communication parameter (a ¥= 0), and denoted as r'.

r =

r+a2r + la 1 + a2 + 2ar'

(3)

One can therefore quantitatively assess the simultaneous influence of all six parameters. The formula for r is more familiar than it might first seem. If q is set to 0, p2 to 1, and k to 0, the formula is the SpearmanBrown prophecy formula. The parameter p, takes on the role of the interitem correlation and «, the number of items. So the formula for r is a generalization of Spearman-Brown. The two key differences are the unique impression and the similarity of meaning systems parameters. Limitations of the Model The model contains several assumptions. First, each act is weighted equally, and the weights are assumed to be the same for each judge. As stated earlier, this assumption represents only a starting point. Second, the number of acts observed is assumed to be the same for both judges. This is not really a serious limitation, but the assumption is made to simplify the equation. Third, the unique impression is assumed to be independent of the scale values and of the other judge's unique impression. Fourth, to scale the unique impression and the scale values, it is assumed that each has a variance of 1. This is not really an assumption, but rather a way of specifying the units of measurement of the variables. A key issue is the magnitude of the correlation between two scale values for two different acts that are judged by two different judges. The assumption has been made that this correlation equals p2 X p t . This assumption implies that the partial correlation of judge / 's scale value for act j with judge k's scale value for act m, controlling for judge fc's scale value for act j, is 0. In words, the assumption is that the nonconsensual part of the

158

DAVID A. KENNY

SID

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

S20

Figure 1. A general model of information integration. (Behavioral acts [A] influence scale values [s], and the scale values and initial impression [s10 ] combine to determine impression [/].)

scale value (the part of the scale value that does not correlate with the average of all other judges' scale values for that act) correlates across acts to the same degree as the consensual part. Alternatively, an additional parameter p3 could have been added to the model, but in the interests of parsimony, I decided to set p3 to be the product of two other parameters. It is also assumed that an act affects only its own scale value; thus, it is assumed that there are no change-of-meaning effects. Finally, it is assumed that communication between judges involves the impression (/) and not the acts (A) or the scale values (s). The meaning of the unique impression depends on the design of the research. If the same pair of judges rates all the targets, then the unique impression refers to the different unique impressions that a single judge has of the different targets. If different judges rate each target, then the unique impression also includes variability in the judges' average unique impression across targets.

Implications of the Equation Considered now are the effects of the parameters on consensus. Initially, to simplify matters somewhat, the following assumptions are made: First, it is assumed that there are no communication effects. Second, to avoid trivial cases, the following assumptions, unless otherwise stated, were made: n > 1, 0 < P2 < 1, and PI < 1. The major conclusions of this section are summarized in Table 1. As acquaintance or n increases, consensus also increases if one of two conditions holds. First, if there is overlap (q > 0) that stays constant over time and the weight of the initial impression

(which is greater than 0) does not change, then r increases as n increases because the unique impression receives a lower relative weight as acquaintance increases. Second, if there is consistency (0 < p, < 1), greater acquaintance leads to greater consensus. In this case, consensus increases because of the increased reliability in the judges' impressions. That is, with increased acquaintance, a judge samples more of the target's acts and so forms a more reliable impression. The relationship between acquaintance and consensus is shown graphically in Figure 2. In the top panel, the following assumptions have been made: p t = . 1, p2 = .5, a = 0, and k = 0. The relationship between n (acquaintance) and r (consensus) is presented for three values of the overlap parameter. When there is perfect overlap (q=\),r does not increase as n does. Perfect overlap and a zero weight for the unique impression result in consensus having no relationship with acquaintance. As q declines, n begins to have more of an effect on r. The increase in consensus as a function of acquaintance is due to the increased reliability due to the larger sample size. However, after an initial increase, the function is relatively flat. The difference between the bottom and top panels of Figure 2 is that the bottom has allowed for a unique impression. The value of k is set to 1.0, and the other parameters remain unchanged. In the bottom panel, regardless of the value of q, as n increases, so does r. However, as q increases, r starts out higher and increases at a slower rate as n increases. For perfect overlap, the function is relatively flat once n equals 6. Figure 2 clearly shows that the strength of the effect of acquaintance or n on consensus depends heavily on the overlap parameter q. If there is high overlap, there is virtually no increase in r as n increases. This is an important result that is not

CONSENSUS AND ACCURACY

159

Table 1 Implications of the General Model Implication

Parameter Acquaintance

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Overlap Meaning system Unique impression Consistency Communication

As n increases, r increases if either kg > 0 or p, > 0 given that k > 0 or q < 1. If k = 0 and q = 1, then r = p2 regardless of the value of n. As q increases, r increases. As p2 increases, r increases if q > 0 or p, > 0. As k increases, r decreases if either q > 0 or p, > 0. As p, increases, r increases if either k > 0 or q < 1. As a increases, r increases.

supported by common sense, but as will be seen, it is supported by empirical work. The unique impression may have a constant effect over time. If the relative weight of the unique impression does not vary as a function of«(i.e., /c,/«, = kjn^, q=\, and p, = 0, then r does not increase as n does. That is, the function resembles the top function in the top panel of Figure 2. Again, consensus and acquaintance are unrelated when overlap is perfect. The maximum limit of r, assuming no communication, is p2. Regardless of the number of acts, the limit for r is the correlation between act scale values. Returning to Figure 2, all of the curves asymptote at .5, the value of p2. Consensus rapidly approaches that maximum in high-overlap studies. Note that if k = 0 and

A general model of consensus and accuracy in interpersonal perception.

Consensus refers to the extent to which 2 judges agree in their ratings of a common target. A general model of interpersonal perception based on Ander...
924KB Sizes 0 Downloads 0 Views