Copyright 1992 by the American Psychological Association, Inc. 0021-9010/92/S3.00

Journal of Applied Psychology 1992, Vol. 77, No. 4, 501-510

Frame-of-Reference Training and Cognitive Categorization: An Empirical Investigation of Rater Memory Issues Lome M. Sulsky and David V Day

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Louisiana State University We considered the effects of frame-of-reference (FOR) training on raters' ability to correctly classify ratee performance as well as their ability to recognize previously observed behaviors. The purpose was to examine the cognitive changes associated with FOR training to better understand why such training generally improves rating accuracy. We trained college students (N = 94) using either FOR or control procedures, had them observe three managers on videotape, and rate the managers on three performance dimensions. Results supported our hypotheses that, compared with control training, FOR training led to better rating accuracy and better classification accuracy. Also consistent with predictions, FOR training resulted in lower decision criteria (i.e., higher bias) and lower behavioral accuracy on a recognition memory task involving impression-consistent behaviors. The implications of these results are discussed, particularly in terms of the ability of FOR-trained raters to provide accurate performance feedback to ratees.

correctly judging the effectiveness levels of specific ratee behaviors. More simply, the training is designed to "calibrate" raters so that they agree on what constitutes varying levels of performance effectiveness for each performance dimension. The intended result of this training is to standardize raters' perceptions of performance with the ultimate goal of improving rating accuracy (Athey & Mclntyre, 1987). A number of studies have shown that FOR-trained raters produce ratings that are significantly more accurate than those provided by raters receiving other types of (or no) training (Bernardin & Pence, 1980; Mclntyre, Smith, & Hassett, 1984; Pulakos, 1984,1986). Although previous research has generally supported the efficacy of FOR training for improving rater accuracy, little attention has been given to why FOR training is successful in this regard. Athey and Mclntyre (1987), for example, found support for the hypothesis that FOR-trained raters would remember more training content than would raters trained in other procedures. However, their results do not directly address the more fundamental question of which FOR training properties account for the improvements in accuracy. That is, what specifically do raters learn during FOR training that leads to enhanced rating accuracy? One possible answer may be found in the theory of cognitive categorization. A discussion of categorization theory and its relevance to FOR training follows.

Training raters to improve the psychometric quality of their ratings has been a major focus of performance appraisal research (D. E. Smith, 1986). In particular, it has been suggested that frame-of-reference (FOR) training offers a promising approach for improving the accuracy of performance ratings (e.g., Bernardin & Buckley, 1981; Pulakos, 1984, 1986). Only more recently, however, has research addressed the changes in cognitive processes associated with FOR training (e.g., Athey & Mclntyre, 1987; Hauenstein & Foti, 1989). Such efforts are important for understanding how FOR training works and why it is effective. Our purpose was to contribute to this sparse but growing knowledge base by examining the effects of FOR training on raters' ability to correctly classify ratee performance as well as their ability to recognize previously observed ratee behaviors. Overview of FOR Training The general philosophy of FOR training is to provide raters with a common reference standard (i.e., "frames") by which to evaluate subordinate performance (Bernardin & Buckley, 1981). FOR training typically involves (a) matching ratee behaviors to their appropriate performance dimensions and (b)

We contributed equally to this project. Order of authorship was determined randomly. We would like to extend our sincere thanks to Bob Lord, Art Bedeian, Dirk Steiner, Neil Hauenstein, and two anonymous reviewers for their insightful comments and criticisms on earlier drafts of this article. David V Day is now at the Department of Psychology, 643 Bruce V Moore Building, The Pennsylvania State University, University Park, Pennsylvania 16802-3106. Correspondence concerning this article should be addressed to Lome M. Sulsky, who is now at the Department of Psychology, University of Calgary, 2500 University Drive NW, Calgary, Alberta T2N1N4, Canada.

Cognitive Categorization and FOR Training It has been suggested that supervisors categorize employees on the basis of preexisting prototypes (i.e., abstract representations based on a set of characteristic features) of effective and ineffective performance (Feldman, 1981). Furthermore, it has been argued that raters tend to use these abstract representations (i.e., general categorizations) more than specific behavioral information when rating performance from memory (Feldman, 1981; Lord, 1985b). If the time between observation and performance rating is lengthy, information decay is likely and memory for specific behaviors is diminished (Cooper,

501

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

502

LORNE M. SULSKY AND DAVID V DAY

1981; Shweder & DAndrade, 1980). Along these lines, a number of researchers have suggested that a primary goal of rater training should be to counteract the inevitable decay of performance information due to memory loss as the time between observing ratee performance and rating that performance increases (DeNisi, Cafferty, & Meglino, 1984; Feldman, 1981; Ilgen & Feldman, 1983; Mount & Thompson, 1987; Nathan & Alexander, 1985; Phillips, 1984). One means that has been advanced for accomplishing this goal is to provide raters with prototypes for the performance dimensions to be evaluated (Hauenstein & Foti, 1989). Establishing prototypes for raters should maximize the likelihood that ratees will be correctly categorized on each dimension on the basis of their observed performance. Correct categorizations imply the formation of accurate impressions about performance quality on each dimension. Such categorizations would contribute to rating accuracy even if raters did not recall specific performance information to guide their ratings (DeNisi et al., 1984; Feldman, 1981). Categorization is relevant to FOR training because a goal of such training is to provide raters with appropriate prototypes across performance levels (e.g., high, average, and low) for each rating dimension. Therefore, one possible explanation for the effectiveness of FOR training is that it provides raters with "correct" prototypes by which to categorize ratee performance on each performance dimension. These prototypes are based on a theory of performance, which is determined by the organization or researchers prior to training. A theory of performance makes explicit the behaviors that constitute varying performance-effectiveness levels on each dimension. If the same theory of performance used in FOR training is also used to generate true score or comparison score ratings for computing rating accuracy (Sulsky & Balzer, 1988), it logically follows that overall rating accuracy should improve when accuracy is denned as the tendency for raters to produce dimensional ratings that are close to the true score ratings. However, memory accuracy for specific ratee behaviors (e.g., whether the ratee actually did a particular thing) may not improve or, perhaps, could even decline. The distinction between accuracy in classifying or categorizing ratees according to performance levels and accuracy in remembering ratee behavior is discussed in the following section. Classification Versus Behavioral Accuracy Lord (1985a) distinguished between classification and behavioral accuracy as a means of better examining raters' cognitive processes. Classification accuracy (CA) refers to a rater's ability to correctly categorize ratees according to performance levels. Lord (1985a), as well as Padgett and Ilgen (1989), operationalized CA by considering two critical factors: (a) the recognition of actually occurring behaviors (hits) that would be expected on the basis of the ratee's performance level, and (b) the recognition of impression-consistent foils (false alarms) that would also be expected on the basis of the ratee's performance level. An implication of CA is that if raters are asked to remember specific ratee behaviors, they may recognize behaviors that are consistent with their previously formed impressions or classifications, whether the behavior actually occurred or not (Feldman, 1981; Lord, 1985b).

Lord (1985a) originally conceptualized CA with regard to forming an overall impression or classification of the target ratee's performance. The nature of FOR training, however, requires that classification be considered on a dimensional basis because raters are trained to form separate impressions on each performance dimension. Subsequent recognition of a behavior will therefore depend on its perceived performance level and the impression formed on the relevant dimension. Lord argued that rater-training programs that encourage the use of categories that simplify information processing—such as FOR training—may enhance CA even though behavioral accuracy may actually decline. We should point out, however, that our use of CA on a dimensional basis must be considered as a variant of Lord's conceptualization of the index, because he was primarily concerned with overall impressions. Behavioral accuracy (BA) is "based on raters' veridical encoding and recall of specific behaviors" (Lord, 1985a, p. 67). It is considered to be more precise than CA, which involves the formation of impressions about the ratee. If a rater is asked about a specific ratee behavior, those behaviors consistent with the previously formed dimensional impression might be recognized regardless of whether they occurred. Thus, FOR training could result in a decrease in BA when a rater is asked to recognize foil behaviors (i.e., behaviors that did not occur) that would be expected on the basis of the ratee's true performance level. Because of the typical time lag between performance observation and ratings, only global, summary evaluations of the performance dimensions are expected to remain in long-term memory. Research suggests that subsequent recall and recognition of ratee information is biased toward prototypical features of a category (Lord, 1985b), even to the extent of endorsing behaviors that did not actually occur (Cantor & Mischel, 1977; Nathan & Lord, 1983; Padgett & Ilgen, 1989; Phillips & Lord. 1982; Woll & Graesser, 1982). These findings suggest a reduction in behavioral accuracy as a function of the strength of the dimensional impression. Bias Lord (1985a) identified bias, or the decision threshold adopted by raters in deciding whether a behavior occurred, as another criterion of interest in studying accuracy. Our distinction between BA and bias stems from threshold models of memory and is important for a full understanding of the cognitive processing that underlies performance on recognition memory tasks. Threshold models define discrete memory states (e.g., observed or did not observe the behavior) instead of assuming a continuum of memory strengths as signal detection models do (Snodgrass & Corwin, 1988). According to threshold models, bias is defined as the probability of saying yes to an item when faced with a recognition task under conditions of uncertainty. In the present context, it is an index of the decision criteria adopted by raters in responding to a recognition memory task. A rater who exhibits a high level of bias is likely to claim recognition of many behaviors and should produce both high hit rates and high false-alarm rates. Because FOR-trained raters are likely to rely on their impressions about ratee performance when asked to recognize ratee behaviors, their decision criteria should differ when they are

FRAME-OF-REFERENCE TRAINING AND CATEGORIZATION

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

faced with impression-consistent or impression-inconsistent behaviors. Specifically, because of the general finding that subjects are influenced by a priori categorizations and impressions during recognition tasks (Lord, 1985b), we expected FORtrained raters to have more liberal decision criteria and exhibit greater bias when faced with impression-consistent behaviors. As defined in the present study, impression-consistent behaviors are those that are consistent with a ratee's true performance level. Raters should be more likely to recognize these behaviors if they form the correct impression on the relevant dimension; the latter is an expected outcome of FOR training. Present Study The central thesis of the present study is that FOR training provides raters with job-relevant prototypes for each performance dimension on the basis of a theory of performance. Such prototypes (or frames) are believed to help raters correctly classify employees on important performance dimensions, which enhances both rating accuracy and CA. However, relying on categorization could reduce BA if raters are asked to recognize foil behaviors that are consistent with the ratee's true performance on the dimension of interest. The decision criteria adopted by FOR-trained raters (i.e., bias) should also be a function of whether a given behavior is impression consistent. Although raters not receiving FOR training may also rely on prototypes and categorization processes, the use of incorrect prototypes is more likely. This would result in reductions in both rating and classification accuracy. Bias, however, should not be greatly affected. In general, we expected FOR training to promote the formation of correct impressions on individual performance dimensions. Therefore, compared with control subjects receiving minimal training, FOR-trained subjects were expected to produce more accurate performance ratings in general (Hypothesis 1); demonstrate higher CA (Hypothesis 2); use a more biased decision criterion for recognizing impression-consistent behaviors (Hypothesis 3); and demonstrate lower overall BA for a particular ratee (Hypothesis 4). In addition, we were interested in examining the issue of bias when raters are asked to decide whether any ratee exhibited a particular behavior. This type of recognition task affords the opportunity to gain further insights into the information processing of raters (i.e., whether FOR training promotes the use of categorization in encoding and recognizing ratee behavior). We expected bias to be higher for FOR-trained subjects than for control subjects because FOR-trained raters were expected to consider more behaviors to be plausible, if such behaviors were consistent with raters' dimensional impressions for one or more of the ratees. We thought it unlikely, however, that control subjects would form consistently similar impressions, if they formed impressions at all. Thus, we hypothesized that, compared with controls, FOR-trained subjects would demonstrate higher bias when asked whether any of the ratees exhibited a given behavior (Hypothesis 5). Method Subjects Ninety-four undergraduates enrolled in psychology classes were randomly assigned to either an FOR condition (n = 47) or a control condi-

503

tion (n = 47). Because data were missing for one control subject, the total sample size was reduced to 93. The mean age of the sample was 22 years, and 67% were female. Subjects received course credit for their participation.

Procedure Subjects were told at the outset that the purpose of the study was to examine how people evaluate other individuals in work situations. We asked subjects to assume the role of a general manager who is responsible for appraising the performance of some employees. After training, subjects viewed videotapes of three actors, each playing the role of a manager, and evaluated the performance of these actors on three performance dimensions, using a behaviorally anchored rating scale (BARS; P. C. Smith & Kendall, 1963). Next, we gave subjects written descriptions of behaviors that either occurred or did not occur (i.e., foils) on the videotapes (across ratees) and asked the subjects to indicate which behaviors they had actually observed. Finally, subjects were given a similar recognition task, except they were asked to indicate which behaviors they had observed for a specific ratee.

Stimulus Materials Stimulus materials were adapted from critical incidents of managerial performance developed by Roberson and Banks (1986). The incidents were based on videotapes of eight fictitious managers interviewing problem subordinates (Borman, 1977). The critical incidents related to four separate performance dimensions: (a) motivating employees, (b) developing employees, (c) establishing and maintaining rapport, and (d) resolving conflicts. In Roberson and Banks's stimulus development work, incidents were originally identified by two graduate students, who watched each tape and translated every incident into a performance dimension. Incidents that were translated into the same dimension by both students were given to three additional graduate students, who each retranslated the incidents into dimensions and also rated the effectiveness of each incident. Incidents retranslated into the same dimension by all three students and having the lowest standard deviation for effectiveness ratings were retained and used to develop written managerial performance vignettes. For the present study, 40 of the incidents (10 per dimension) were videotaped with graduate student actors. In total, 2 incidents per dimension were videotaped for each of five managers (i.e., ratees). Two of the managers were used as practice ratees for the FOR-training condition, and three were used as target stimuli for the performance evaluation task.

Rating Scales and Comparison Scores Three 7-point BARS developed by Borman (1978) were used to assess ratee performance on three of the four dimensions described in the previous section. The resolving conflicts dimension was not used for evaluation purposes; these incidents served as distractors and were considered to be irrelevant to performance. The resolving conflicts incidents were included to make the appraisal task more realistic, given that raters often observe ratee behavior that is irrelevant to performance evaluation. The details of scale development were outlined by Borman (1978). Because the computation of rating accuracy requires the availability of true or comparison scores (Sulsky & Balzer, 1988), three true scores (one from each performance dimension) for each of the five ratees were developed for the BARS. Building on procedures developed by Borman (1978) and recommended by Sulsky and Balzer, these scores were estimated separately by two judges (one of the trainers used in the study and a graduate student). First, the judges discussed the rating

504

LORNE M. SULSKY AND DAVID V DAY

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

dimensions and reached consensus about what constituted effective and ineffective performance. Thus, a theory of performance was generated. Also, consensus was reached on how information should be combined when forming a rating on a specific performance dimension. Next, both judges provided ratings on all performance dimensions after examining each of the videotaped incidents for a specific manager. The judges then met to discuss rating differences, with the goal of arriving at a set of mutually agreeable true scores that were based on the performance theory. Interrater agreement calculated before the consensus meeting was equal to .88 (intraclass correlation), suggesting satisfactory initial agreement between the judges' performance ratings. The true scores for each of the three ratees used in the evaluation phase of the study were, for motivating employees, developing employees, and establishing and maintaining rapport, respectively, 1,2, and 5 for Ratee A; 7,6, and 1 for Ratee B; and 5,7, and 5 for Ratee C.

Recognition Measures Overall recognition. All of the behavioral incidents exhibited by the five ratees on the three focal performance dimensions (30 total incidents) were transcribed and listed along with an equal number of devised incidents, which were not on the videotapes. The list of 60 incidents was given to 56 undergraduates, who rated how well each incident fit their image of an effective manager in exchange for course credit. A similar procedure was used by Lord, Foti, and DeVader (1984) to assess the prototypicality of traits and behaviors relevant to leadership perceptions. Because only clear-cut examples of either good or poor performance were desired, incidents that were indicative of good performance (rating of 4 or higher on a 5-point scale) or low performance (rating of 2 or less) were candidates for selection. Incidents that occurred were yoked with foils such that pairs of these incidents with equivalent effectiveness levels were selected for the recognition measure. The motivating employees and establishing and maintaining rapport dimensions each had two high-performance incidents that actually occurred, two high-performance foils, two low-performance incidents that actually occurred, and two low-performance foils. For the developing employees dimension, however, only one actual incident had a low performance rating. We therefore decided to include only one low-performance foil. In total, 22 incidents (11 being foils) were chosen for the recognition measure. For each of the randomly presented incidents, subjects were asked to indicate in writing (yes or no) whether each of the incidents was observed on the videotapes. Subjects did not have to indicate which ratee (if any) was involved in the incident. As such, this measure assessed raters' recognition across ratees. Examples of videotaped incidents along with their mean performance ratings are contained in Table 1. Individual recognition. A 12-item recognition measure (six incidents that occurred, six foils) was developed for Ratee B. The six occurring behaviors (two per performance dimension) were all impression consistent (i.e., each incident was consistent with the overall true score of the ratee on the relevant dimension). For example, both incidents on the establishing and maintaining rapport dimension were indicative of poor performance because Ratee B's true score was low for the dimension. The impression formed by FOR-trained raters about Ratee B's rapport was expected to be low. Impression-inconsistent behaviors that actually occurred could not be included because all of the incidents for Ratee B were consistent (i.e., they were either all positive or all negative). We also included two nonoccurring foils per dimension. For each dimension, one foil was impression consistent, and one was impression inconsistent. For example, Ratee B actually performed poorly on the establishing and maintaining rapport dimension. We included one foil that was indicative of poor rapport (impression consistent) and

another foil that was indicative of good rapport (impression inconsistent). Subjects indicated in writing (yes or no) whether they recognized each of the 12 incidents for the individual ratee.

Rater Training Rater training was conducted in groups ranging from 2 to 6 participants. Two trainers were used, and each conducted approximately one half of the training sessions. Frame-of-reference training. FOR training procedures followed those adopted by Pulakos (1984, 1986). First, participants were told that they were to evaluate the performance of ratees on separate dimensions of performance. Next, they were given the BARS, and the trainer read through the dimension definitions and scale anchors aloud. The trainer then discussed with the participants which ratee behaviors were indicative of the different performance levels represented on each of the scales. For example, behaviors representative of a 7 on the motivating employees dimension were distinguished from behaviors representative of a 4 or 1 on the same dimension. The goal was to create a common frame of reference among the raters by providing them with a theory of performance such that they would agree on the particular performance dimension as well as the effectiveness level for each behavior. This procedure was thought to establish managerial performance prototypes for distinct performance levels on each of the rating dimensions. Participants were then presented with the incidents for the first of two practice ratees. Participants practiced evaluating ratee performance using the BARS and indicated their ratings to the rest of the group. The group discussed what behaviors were used to decide on ratings for each performance dimension, paying special attention to any discrepancies among participant ratings. The trainer provided feedback to the participants on their ratings and explained why the ratee should receive a particular rating on each dimension. Once this process was completed for the first ratee, it was repeated for the second practice ratee before subjects proceeded to the target ratees. Control training. Subjects in the control condition were told they would be evaluating ratee performance on a number of different performance dimensions at a later time. They were introduced to the BARS as the trainer read through the dimension definitions and scale anchors; however, no specific training took place in this condition. Instead, a lecture on performance appraisal research was given. The FOR and control training sessions each lasted approximately 1 hr and 15 min.

Dependent Measures Rating accuracy. One type of rating accuracy was assessed with the overall distance measure reported by Mclntyre et al. (1984). This measure provides an index of how close participant ratings are to the true scores generated by the judges. Rating accuracy was also examined with Borman's (1977) differential accuracy (DA) measure. This measure examines the correlation between ratings for each performance dimension and the corresponding true scores for the same dimension across ratees. The average of the correlations across dimensions is taken as an index of DA for each rater. Classification accuracy. This accuracy index was computed for each subject according to the formula provided by Padgett and Ilgen (1989): CA = (impression-consistent hit rate + impression-consistent false-alarm rate) - (impression-inconsistent hit rate + impression-inconsistent false-alarm rate).

505

FRAME-OF-REFERENCE TRAINING AND CATEGORIZATION Table 1 Examples ofVideotaped Behavioral Incidents Mean performance rating

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Dimension/example Establishing and maintaining rapport The manager begins the interview by greeting the employee in a warm friendly manner. He offers the employee a chair and tells him to get comfortable. When the employee becomes angry over not being informed of a cutback in overtime in his department, the manager tells him that he is overreacting. Motivating employees The manager tells the employee that he is very valuable to the company and that he is the kind of person the company needs to keep. The manager is very blunt in pointing out the employee's previous poor performance in certain areas and ignores the employee's contention that his department is one of the most productive in the company. Developing employees The manager tells the employee about the types of skills required for management and offers to discuss how he can further develop necessary skills to be an effective manager. The manager demonstrates little interest in the employee's professional development and tells him that he will have to work on his own to accomplish changes in his supervisory style.

4.80 1.78 4.78

1.29

4.81 1.38

Note. Higher ratings are associated with better performance.

However, because there were no impression-inconsistent hits for any of the ratees, a constant value of zero was used for this portion of the formula across raters in both training conditions. Behavioral accuracy. For each subject, BA was assessed separately for the overall recognition measure and for the individual ratee recognition measure. We employed a BA index recommended by Snodgrass and Corwin (1988) for recognition data, which required subtracting the false-alarm rate from the hit rate for each rater: BA = (impression-consistent hit rate + impression-inconsistent hit rate) - (impression-consistent false-alarm rate + impression-inconsistent false-alarm rate). Once again, impression-inconsistent hits did not exist, and thus a constant zero was used in the computation. Bias. Rater bias was computed with a measure recommended by Snodgrass and Corwin (1988): Bias = false-alarm rate/[l - (hit rate - false-alarm rate)]. Higher scores on this index are associated with more lenient decision criteria. To examine bias for the impression-consistent behaviors and impression-inconsistent behaviors separately for the single ratee, we simply inserted the relevant hit and false-alarm rates into the formula. However, because bias is undefined for hit rates of 1.0 and corresponding false-alarm rates of 0, estimates were corrected by adding 0.5 to each frequency and dividing by N+ I, where A' is the number of relevant items. Snodgrass and Corwin (1988) recommend the routine use of this correction in the analysis of recognition memory data. Results Hypothesis 1 predicted that FOR-trained subjects would produce ratings higher in overall accuracy than those of the control

subjects. The groups were compared (t tests) on two traditional accuracy indexes. The results (see Table 2) were significant for both distance accuracy, t ( 9 l ) = -5.57, p < .01, and differential accuracy, t ( 9 l ) = 7.13, p < .01, suggesting that FOR-trained subjects produced ratings that were significantly more accurate than those of the control subjects. Hypothesis 2, predicting higher classification accuracy of FOR-trained subjects, was also supported, t(91) = 2.61, p < .01. The results for Hypotheses 2-4 are summarized in Table 3. To test Hypothesis 3, predicting higher bias for impressionconsistent behaviors, we computed t tests separately for impression-consistent and impression-inconsistent behaviors. The results supported the hypothesis that FOR-trained subjects would show significantly greater bias for impression-consistent behaviors than would controls, t ( 9 l ) = 3.31, p < .01; however, no

Table 2 Means, Standard Deviations, and Accuracy Test Results Across Ratees Group Dependent variable Distance accuracy M SD Differential accuracy M SD

FOR

Control

1.26 0.48

0.41

1.58 0.84

0.59 0.44

1.77

-5.57* 7.13*

Note. Low values for distance accuracy and high values for differential accuracy denote greater accuracy. FOR = frame of reference. * p

Frame-of-reference training and cognitive categorization: an empirical investigation of rater memory issues.

We considered the effects of frame-of-reference (FOR) training on raters' ability to correctly classify ratee performance as well as their ability to ...
1MB Sizes 0 Downloads 0 Views