Eye Movements and Scene Perception KEITH RAYNBR and ALEXANDER POLLATSEK University of Massachusetts

Abstract Research on eye movements and scene perception is reviewed. Following an initial discussion of some basic facts about eye movements and perception, the following topics arc discussed: (t) the span of effective vision during scene perception, (2) the role of eye movements in scene perception, (3) integration of information across saccades, (4) scene context, object identification and eye movements, and (5) the control of eye movements. The relationship of eye movements during reading to eye movements during scene perception is considered. A preliminary model of eye movement control in scene perception is described and directions for future research are suggested. Resume Une recherche sur les mouvements oculaires et sur la perception des scenes fait I'objet de I'ouvragc. Ccrtaines notions dc base sur les mouvements oculaires et sur la perception sont d'abord exposecs, puis les sujets suivants sont abordes: (1) l'&endue de la vision effective pendant la perception d'une scene, (2) Ic role des mouvements oculaires dans la perception des scenes, (3) I'assimilation dc reformation d'une saccadc a 1'autre, (4) le contcxte de la scene, ('identification des objels et les mouvements oculaires ct (5) le contrfile des mouvements oculaires. Le rapport entrc les mouvements oculaires pendant la lecture et les mouvements oculaires pendant la perception d'une scfcne est examine*. En dernier lieu, on decrit un modele preiiminairc de contrdle des mouvements oculaires dans la perception des scenes et on propose des pistes dc recherche au lectcur.

During normal cognitive processing lasks such as scene perception, reading, and visual search, we typically move our eyes on the order of three to four times per second (Rayner, 1978, 1984). An important reason for these saccadic eye movements is acuily limitations of the visual system. The ability to process fine detail is pretty much limited to the two degrees of central vision making up the foveal region of the retina; even within the fovea there is a decrease in visual acuity from the point of fixation out to the edge of the Canadian Journal of Psychology, 1992, 46:3, 342-376

Eye Movements and Scene Perception

343

fovea. Outside of the fovea, acuity functions show an even more pronounced decline with increasing eccentricity. The region beyond the fovea is typically referred to as parafoveal vision (which extends out to five degrees from the fixation point) and peripheral vision (which begins at approximately 5 degrees from fixation and consists of the rest of the visual field). Thus, if we need to process detailed information from the visual array, we need to point our eyes in the direction of that information. As we survey a panoramic scene consisting of various types of stimuli (which we will refer to as objects for ease of exposition), it is far from clear how we extract information from the array. While we are often under the impression that we can see the whole scene in a single glance, if we maintain fixation on a single point, it is clear thai we can not perceive a lot of detail in the parafovea and periphery. However, we have the feeling that, even on a single fixation, we can process the "gist" of the entire scene. Some of the research we review supports this intuition. What is less clear is how much movement of the eyes is strictly necessary for processing needed information in a scene and how much just results from habit or a desire to relieve boredom. (There are also head movements made when attention moves larger distances, Sanders, 1970; however, we will restrict our discussion to eye movements.) Part of the problem in answering the question of the function of eye movements in viewing scenes is that there is no well defined task of "sceneperception". We are either looking at a scene searching for a particular object or person, processing it as a set of geometric objects to be avoided as we locomote through space, extracting aesthetic pleasure, or not really engaged in any well-defined task as we view the world around us. One possibility is that the particular task does not matter, and processing a large amount of information from a scene is unavoidable, regardless of what we intend to do with the scene information. Some of the research we review suggests that this is not correct, however, and that what we process is, at least to some extent, task dependent. We thus offer no magic solution to the question of what is meant exactly by scene perception. However, what we take as a likely substrate of most such tasks is an identification of the gist or setting of a scene (such as whether you are in a room or in the middle of the woods) together with an identification of all or most of the important objects in the scene. We will beg the question of what it means for an object to be important and assume that viewers would usually attempt to identify most objects that subtended a degree or two in the visual field. Another question that we will avoid is what level of identification is usually attempted. That would clearly depend on the type of object and the situation. When looking at a farm scene, most city dwellers would probably be content to identify something as a tractor, whereas a farmer would likely attempt a more precise identification. Similarly,

344

Rayner and Pollatsek

for some purposes, people might be contcnl to identify a human in a scene as a man or woman, whereas for other purposes they might want to identify who the person is and/or whether it is someone thai they know. To summarize, eye movements are a normal aspect of everyday visual perception. They allow detailed information to be processed that otherwise would not be; however, their exact function in the information processing chain of "scene perception" is not clear. In this article, we will discuss several aspects of eye movements during scene perception. We will begin by discussing some basic facts about eye movements in scene perception. We will then discuss five topics which we deem essential to understanding eye movements and scene perception: (F) the span of the effective stimulus; (2) the importance of eye movements in scene perception; (3) the integration of information across eye movements; (4) the relationship between scene context, object identification and eye movements; and (5) the control of eye movements. SOME BASIC FACTS ABOUT EYE MOVEMENTS

There are four basic types of eye movements: pursuit, vergence, vestibular, and saccadic eye movements. Our emphasis in this article is on saccadic eye movements because they are the most relevant for understanding scene perception. Actually, vergence eye movements are relevant for scene perception when the stimulus consists of a three-dimensional array. Vergence eye movements occur when we move our eyes from a distant object to a near object (or vice versa) along the same line of sight. Since most research on eye movements and scene perception has utilized two-dimensional arrays, there are little data on vergence eye movements in this context. Pursuit eye movements occur when an object moves across the visual field at a relatively steady velocity. Since pursuit movements typically require a moving stimulus to elicit the eye movement, they are not particularly relevant to the perception of static scene displays. Vestibular eye movements occur when the eyes rolate to compensate for head and body movements in order to maintain the same direction of vision. Like vergence and pursuit movements, vestibular movements could occur when looking at a naturalistic three-dimensional scene, but since most research on scene perception has involved twodimensional arrays, they have not been studied in this context. In contrast to vergence, pursuit, and vestibular movements, saccadic eye movements have been examined in the context of the perception of two-dimensional scenes and they are the focus of the remainder of this article. Before moving to a more extended discussion of saccadic eye movements, we need to also mention three types of small movements of the eyes: nystagmus, drifts, and microsaccades. Although researchers interested in eye movements and scene perception typically talk about fixations as the period of lime when the eyes are still and new information is abstracted from the

Eye Movements and Scene Perception

345

scene, in a way the term fixation is a misnomer. This is because the eyes are never really still: Even during a fixation there is a constant tremor of the eye which is referred to as nystagmus. These tremors of the eye are quite small and fast, occurring several hundred times a second. Their exact nature is somewhat unclear, though it is often assumed that the movements are related to perceptual activity and help the nerve cells in the retina to keep firing. Drifts and microsaccades tend to be somewhat larger movements than the nystagmus movements. While the reasons for these movements are not completely clear, it appears that the eye occasionally drifts (i.e., makes a small and rather slow movement) because of less than perfect control of the oculomotor system by the nervous system. When this happens, there is often a small microsaccade (i.e., a much more rapid small movement) to bring the eye back to where it was. Most experimenters assume that such small movements are "noise" and adopt scoring procedures in which these small movements are ignored. For example, some scoring procedures take successive fixations that are separated by less than a half a degree and lump them together as a single fixation. Another alternative is a more sophisticated pooling procedure in which Fixations are pooled if the intervening saccade is less than half a degree and at least one of the fixations is short (100 ms or less). Most eye movement data in scene perception have been adjusted using some sort of procedure that pools some fixations and ignores at least some small drifts and microsaccades. In some cases, the eye movement recording system is not sensitive enough to detect these small movements, so that such movements are automatically ignored. Other researchers, with more sensitive equipment, decide on some sort of criterion for pooling. Since drifts and microsaccades are relatively uninteresting aspects of the eye movement record, and since there is enough complexity in the data without worrying about them, our subsequent discussion will ignore them for the most part. Saccadic eye movements arc typically regarded as ballistic movements, which means that once they have begun, we cannot alter their trajectory. Like any other type of motor movement, saccadic eye movements involve some complex control processes. Experiments measuring the reaction time of the eyes in which subjects must move their eyes to simple stimuli have demonstrated that the minimum latency of an eye movement is on the order of 150175 ms (Rayner, Slowiaczek, Clifton, & Bertera, 1983; Salthousc & Ellis, 1980) even under circumstances in which spatial and temporal uncertainty is reduced to a minimum (but see Fischer, 1992, for a review of special circumstances under which very fast "express" saccades may occur). Under some conditions (such as degradation of the target stimulus), the latency of the saccade can be as much as 250-300 ms. Basically, this means that a fair amount of the time we are looking at one part of lhe scene, programs are set in motion to move our eyes elsewhere. However, these activities (processing

346

Rayner and Pollatsek

TABLE 1 Average fixation duration and saccade length during reading, scene perception, and visual search; values are estimated from a variety of sources (adapted from Rayner, 19B4) Average Fixation Length

Average Saccade Length

Reading

225

2deg

Scene

330

4dcg

Visual search

275

3deg

the fixated object and programming the next eye movement) most likely occur in parallel. Fixation duration and saccade size Table 1 presents some comparative data for average measures of eye movement behaviour in scene perception, reading, and visual search. As seen in Table 1, the period of time that the eyes are relatively still in a fixation is typically somewhat longer when people look at a scene than when they read. Also, the average saccade size is larger when looking at a scene than when reading. However, it is also important to note that the average values for fixation duration and saccade size in scene perception can be influenced by the characteristics of the scene and by the task that the subject is asked to do. Tor example, if the scene is densely packed with information, the average fixation duration will tend to increase and the average saccade size will tend to decrease in comparison to a scene that is not so densely packed with information. Likewise, if the task is to memorize all the objects in a scene, the average fixation will be longer and the average saccade size smaller in comparison to when the task is to simply look at the scene to obtain the gist. If the same scene is shown to two groups of subjects, and one group is asked to search for a target object while the other is told to look at the scene in anticipation of a scene recognition memory test, the former group will most likely show longer fixations and shorter saccades (unless the target object is very obvious in the scene). In addition to the fact that average fixation duration and saccade size can be influenced by stimulus characteristics and task demands, it is also important to note that there are considerable individual differences in both measures. Some subjects' average fixation durations when looking at a scene may be on the order of 200-225 ms, while others may be on the order of 400425 ms. And, there is also considerable variability in saccade size, with ranges between 2-6 degrees. Finally, there is also large within-subject variability on both measures: A given subject may easily have fixations ranging between

Eye Movements and Scene Perception

347

too and 600 ms and saccades ranging between less than a degree and over 6 degrees. When our eyes move from one region of the scene to another, they move at extremely rapid rates. The velocity of a saccadic eye movement during scene perception is in the range of 200-500 degrees per second and is a monotonic function of saccade size. The velocity rapidly rises during a saccade to a maximum that occurs slightly before the midpoint of the movement, then drops at a slightly slower rate until reaching its target location. A 2 degree saccade takes about 30 ms and a 5 degree saccade takes about 40 ms to execute. Saccadic suppression During the time that our eyes move, little useful information is abstracted from the scene we are looking at. In fact, visual input is disrupted for a brief lime prior to and after the onset of the saccade. This phenomenon, known as saccadic suppression, has been investigated for over a hundred years. Why don't we see anything during the saccade? The answer seems to be that the eyes are moving so fast during a saccade that the image painted on the eye by a fixed stimulus would be largely a smear and thus highly unintelligible. However, we aren't aware of any smear. Thus, there must be some mechanism suppressing the largely useless information that is "painted" on the retina during the saccade. One possible mechanism is central anesthesia in which the brain sends out a signal to the visual system that an eye movement is about to be made and to ignore all input from the eyes until after the saccade is over. There is, as we noted above, evidence (see Matin, T974) that the thresholds for stimuli shown during a saccade (and a bit before it begins and after it ends) are raised. The threshold raising before and after the saccade is probably not of much significance for scene perception under most circumstances since most of the scenes we look at are above threshold. Thus, it is not clear whether these relatively small threshold effects would mean that the ability to extract information from the scene would be altered significantly; it might be like the difference between looking at a picture with a 60 watt bulb and with a T50 watt bulb. However, the threshold effects are more likely to be significant with the moving eye, where the contrast between the light and dark parts of the smear would be far less. For many years, central anesthesia was accepted as the main mechanism by which information during saccades was suppressed. However, more recent experiments indicate that a different mechanism explains at least part of the suppression and perhaps all of it (Matin, 1974). It can be demonstrated that under certain unnatural circumstances visual input during the saccade can be perceived (Uttal & Smith, 1968): When the room is totally dark prior to and after the saccade and a pattern is presented only during the saccade, a smeared

348

Rayner and Pollatsek

image of the pattern or scene is perceived (Campbell & Wurtz, 1978). Since the blur is seen if no visual stimulation precedes or follows it, the implication is that the information available prior to and after the saccade during normal vision masks the perception of any information acquired during the saccade. In sum, while it cannot be concluded for sure that absolutely no visual information is extracted during saccadcs in scene perception, the bulk of the evidence indicates that if visual information gets in during a saccade, it is of little practical importance. THE SPAN OF EFFECTIVE VISION

Earlier, we raised the question of how much useful information is obtained during each eye fixation on a scene. We asserted that it was plausible that a global apprehension of the "gist" of a scene might be extracted on a single fixation, but that object identification would be limited by the ability to extract detail in the parafovea and periphery. The data support this commonsense view and also tend to suggest that object identification may need covert attention to succeed. Before jumping ahead to scene perception, however, we will briefly review the data on the span of effective vision in reading, which is much better understood. The span of effective vision in reading has been primarily studied using the moving window technique (McConkic & Rayner, 1975) or some variant of it (Balota, Pollatsek & Rayner, 1985; Rayner, 1975). In the moving window paradigm, the reader sees a normal region of text around fixation (the "window"), but the text outside the window is mutilated in some fashion. The basic finding (see Rayner & Pollalsek, 1987, for a complete review) is that, in Hnglish, the window of text is small, extending no more than 15 characters hori7.ontally from fixation.1 Moreover, there is no evidence that useful information is extracted either above or below the line of text being read. The horizontal limit of (he window stated above appears to be determined by visual acuity considerations, since useful information is not extracted beyond 5 degrees even if there is an isolated word in the parafovea (Rayner, McConkic & Ehrlich, 1978) or if there is no useful information in the fovea (Rayner & Bcrtera, 1979)- The window appears to be somewhat smaller in some other languages (Hebrew, Chinese, and Japanese) in terms of the number of characters that can be processed; however, it is not clear whether this difference is due to visual characteristics of these orthographies or to more central factors such as the fact that the same morphemic information is represented more compactly in these languages. The fact that useful information is apparently not extracted above or below the line of text suggests, however, that attcntional factors also play a part in 1 In reading, the appropriate metric is saccade length rather than visual angle (Morrison & Rayner, 1981). For most normal sized text, three or four characters subtend one degree of visual angle.

Eye Movements and Scene Perception

349

the span of effective vision. In addition, several studies demonstrated that the span of effective vision is asymmetric; no useful information is extracted to the left of the fixated word in English (Rayncr, Well & Pollatsek, 1980). This indicates that the reader attends only to the Fixated word and that information to the right that is about to be fixated. The attenlional nature of this asymmetry is supported by the finding that the window is asymmetric in the opposite direction for either Hebrew (which goes from right-to-left) or for English text in which the words go from right-to-left (Pollatsek, Bolozky, Well, & Rayner, 1981; Inhoff, Pollatsek, Posner & Rayner, 1989). Another interesting finding is that the size of the window appears to vary with the difficulty of the task. If the foveal information is difficult to process either because it is visually complex (Inhoff et al, 1989) or semantically difficult (Henderson & Ferreira, 199a; Rayner, 1986), the span of effective vision gets significantly smaller in reading. We think that the picture for the span of effective vision in scene perception mirrors that for reading with one important difference: Meaningful information can be extracted much further from fixation in pictures than in text. However, the data clearly indicate that there arc significant limits to how far from fixation information can be extracted even in scene viewing. Two lines of research support this conclusion. First, Nelson and Loftus (T980) allowed subjects to look at a scene for a limited amount of lime while their eye movements were recorded. They then determined how close to fixation an object had to be for it to be recognized as being in the scene. They found that objects located within about 2.6 degrees from the nearest fixation point were generally recognized, but it depended to some extent on the characteristics of the object. Their data also indicated that objects that were never fixated within 2.6 degrees were recognized little better than chance. Similar results were reported by Nodine, Carmody, and Herman (1979). The second line of research utilized the moving window paradigm discussed above. As in the reading studies discussed above, subjects are free to move their eyes wherever they wish. However, the amount of information available for processing on each fixation is controlled by the experimenter. Saida and Ikeda (1979) varied the exposure duration of a scene and the window size in a study phase; the probability of correctly recognizing a picture in a later memory test was determined as a function of these two variables. In addition, two picture sizes were used (14.4 x [8.8 degrees and TO.2 x 13.3 degrees). As would be expected, and consistent with reports by Loftus (1972) and Nelson and Loflus (1980), recognition accuracy increased with viewing duration. More importantly, Saida and Ikeda found a serious deterioration in the recognition of pictures when the window was limited to a small area around the fovea (about 3.3 x 3.3 degrees) on each fixation. Performance gradually improved as window size became larger, reaching asymptote when it was about 50% of the entire pattern size. Saida and Ikeda made the point

350

Rayner and Pollatsek

thai the scene is scanned so that there is considerable overlap of information from Fixation to fixation. These studies thus indicate (hat while some information is extracted from the parafovea and periphery that is used in later memory tasks, they also indicate that objects arc typically not identified (or, at least remembered) when no fixation is near them. This non-recognilion could either be due to acuity factors, attcntional factors, or both; at present no data resolve this issue decisively. However, there are data that suggest that attentional factors are important in object perception as well as in word perception. An experiment by Henderson, Pollatsek and Rayner (1989) used the moving window technique to gel at this issue. Four line drawings of objects were placed in the corners of an imaginary square, and subjects were asked to inspect them in order to answer an immediate memory lest about Ihe objects. Processing time on an object (measured by fixation time') was facilitated if a "preview" of that object was available in the visual display just before the object was about to be fixated. Otherwise, extrafoveal information about that object was irrelevant. Thus, the story that emerges from Henderson et al.'s study is the same as in reading; useful information about objects is extracted only when they are fixated or about to be fixated. It is hazardous, of course to generalize from these experiments to scene perception; nevertheless, the results are intriguing. We are currently exploring generalizing this paradigm to scene perception. Mosl of the experiments we have reviewed suggest that there are real limits (both due to acuity and attentional factors) to processing objects that are not fixated. The results of Saida and Ikeda indicate, however, that some useful information is extracted quite far from fixation, although their experiments do not allow an assessment of what information is extracted in the periphery. Some research on scene perception, however, indicates that global information about the scene can be extracted quite quickly from an image of a scene. We will discuss laler the issue of whether this global information has an effect on object recognition. For now, we simply want to establish that this information can be extracted quite rapidly. In one paradigm, Biedcrman, Mezzanotte and Rabinowitz (1982; see also Biederman, 1972) exposed images of scenes briefly (for 150 ms) followed by a pattern mask. Prior to viewing the scene, the subject saw a word indicating a target object (e.g. "sofa") and when the pattern mask appeared, the subjects were cued to a spatial location. They judged whether the object in that location had been a sofa or not. We will get to the details of this study later, 1 A number of different measures of processing lime can be inferred from fixation time on an object: First fixation duration refers to the duration of the first fixation on an object; gaze duration refers to the sum of the fixations on an object prior to an eye movement to another object; and total time on the object includes Ihe sum of all fixations (including regressions) on an object. The most frequently used measures are first fixation and gaze duration.

Eye Movements and Scene Perception

351

but for the moment, the result of importance is that the ability to identify the target object was affected by whether the object belonged in the scene or not. Moreover, the responses were generally made within a second. While there is some issue of whether the influence of the scene on object identification in this task is perceptual or post-perceptual, the fact remains that the relationship between the scene and the object is having some effect within a second or so. This indicates that the gist of the scene can be extracted quite rapidly in a single fixation. Experiments by Boyce, Pollatsek and Rayner (1989) confirmed this finding. Moreover, Boyce el al. employed scenes in which there was no useful information near the fovea and in which it would be virtually impossible to identify the scene from any significant local detail. Thus, it appears that relatively global, extrafoveal information is crucial for the identification of the "scene" (e.g. that the background is a "street scene"). Moreover, follow-up experiments by Boyce and Pollatsek (1992a, 1992b) indicated that the ability to extract such global information rapidly is not restricted to tachistoscopic presentations. We will leave the details for a succeding section, but the important result is that the meaning of a scene had some effect on the time to name an object either on the first or second fixation on the scene (i.e., within a second of the global scene information being available) . Thus, it appears that the ability to extract global extrafoveal information to identify a scene is not limited to situations involving tachistoscopic presentations. HOW IMPORTANT ARE EYE MOVEMENTS IN SCENE PERCEPTION?

As we argued at the beginning of the article, subjects make many eye movements on a scene when examining it for the purpose of remembering it or searching for an object in the scene. This suggests that fixating an object or detail in a scene is necessary, or at least beneficial, for its identification. However, it has often been implied that examining the pattern of eye movements during scene perception is a high-cost low-yield endeavor, since the essence of scene perception (including the identification of the component objects) occurs on a single fixation. This view stems largely from a particular interpretation of experiments employing lachistoscoppic presentation. In some experiments (e.g., Potter, 1976) many scenes are presented rapidly (about 200 ms each) and subjects presumably can perceive each of these scenes. However, such experiments tend to use stimuli similar to magazine photographs. These stimuli are typically composed in a way that the crucial information for recognizing the scene is in or near the centre of the scene (often a key object). Thus, it is not clear that correct identification of the scene implies that much or any parafoveal or peripheral information has been processed. The experiments of Biederman, et al. O982), however, indicate that significant information about objects not in the fovea can be extracted. The

352

Rayncr and Pollalsek

finding—that target objects (at some distance from fixation) can be identified from a single tachistoscopic presentation with about 70-90% accuracy—has been taken to suggest that the bulk of object identification takes place on the first fixation on a scene. Such a conclusion, however, goes beyond the data. Remember that the task in such experiments is to respond whether or not the object in the location is a member of a prespecified object category (e.g. "sofa")- Achieving performance well above chance in this forced-choice task does not necessarily indicate that the object has been fully identified; it only indicates that sufficient information has been extracted to discriminate a sofa reasonably reliably from some "false alarm" object category (which was probably in general fairly physically discriminable from the target category). Thus, as stated earlier, these studies indicate that significant information about objects in the parafovea and periphery is extracted, not necessarily that full identification is accomplished. A second line of evidence that significant object or detail information is extracted from a single fixation is that subjects often fixate an important or unusual object early in scene viewing (Friedman, 1979; Loftus & Mackworlh, 1978). The suggestion is that significant semantic information must have been extracted prior to fixation to indicate that the object was important or unusual. It certainly seems well-documented that "informative" regions tend to be fixated instead of uninformative ones (Antes, 1974; Friedman & Liebelt, 1981; Mackworth & Morandi, 1967); the issue is what the substrate of informativeness is. Certainly in the Mackworth and Morandi study, low-level information such as contours, regions of unusual brightness, seemed to be the primary determinants of informativeness. In some studies, such as the Loftus and Mackworth (1978) study, the intention was to make the object unusual on a semantic basis. However, inspection of the examples suggests that these objects may have been atypical visually as well (e.g., the octopus in a farm scene was the only small object with lots of wavy lines). To summarize, significant useful information can be extracted about objects extrafoveally; however, the data are far from clear on how often complete identification is accomplished on the basis of this information. An experiment by Parker (1978), in fact, suggests that full identification usually needs a fixation. In his experiment, subjects memorized a target scene (e.g. a classroom scene with significant objects in it such as a teacher, several students, a blackboard, a map on the wall). They then were shown a series of comparison scenes and asked to judge whether they were identical to the target scene. Many of the discriminations asked for were subtle (such as a change from one male student to a different male student), but some were not so subtle (such as a change from a male student to a female student). The chief finding was that subjects needed to fixate the object that was different in order to make a "different" judgement; however, they did not need to fixate a region where an object in the target scene was missing from the comparison

Eye Movements and Scene Perception

353

scene in order to make a "different" judgement. Parker's data thus suggest that fixating an object may be necessary for reliable identification. Let us attempt a plausible synthesis of the above data. Significant information about objects and detail can be extracted from the parafovea, if not the periphery. Biederman et al. (T982) showed that forced-choice decisions on parafoveal objects in scenes can be made with reasonable accuracy. We (Pollatsek, Rayner & Collins, 1984) also showed that isolated objects 2 degrees wide could be identified 95% of the time if they were 5 degrees from fixation, and 85% of the time if they were 10 degrees from Fixation. The subjects' task was to decide which of 20 objects the line drawing represented. Thus, the viewer has some chance of identifying a particular object on the basis of extrafovcal information, especially if it is the only object in the display and/or if attention is drawn to it. However, both the observation that subjects typically make many fixations during natural scene viewing and Parker's (1978) finding that subjects invariably fixated an object lo determine whether it was different from a target object indicates that viewers fixate objects to be certain about what they are. Whether the majority of such fixations are absolutely necessary for identification may be a moot point, since it is so easy to fixate an object of interest.1 We thus think that the naive picture of scene perception introduced at the beginning of the article is, in outline, correct. Much of the global information about the scene background or setting is extracted on the initial fixation. Some information about objects or details throughout the scene can be extracted far from fixation. However, if identification of an object is important, it is usually fixated. The work discussed in the next section indicates that this fovcal identification is aided significantly by the information extracted cxtrafoveally. This, of course, is just a rough sketch and it leaves many questions unanswered. One area we are completely omitting is how the geometric representation of the scene is built up, and another is how the global meaning of a scene is understood (e.g. the "point" of a cartoon, see Carroll, Young & Gucrtin, 1992). Another functional use of fixations in scene viewing is that they may be instrumental in aiding memory of a scene. One fairly interesting and controversial claim regarding the importance of eye fixations during scene perception was advocated by Noton and Stark (1971a, 1971b; see also Stark & Ellis, T981) in their "scanpath" theory of perception. According to Noton and Stark, the process of pattern perception is viewed as a serial process with a fixed-order strategy of extraction of information from the scene both during an initial viewing period and during subsequent recognition of the scene. Although the sequence of fixations will be largely unpredictable before 3 ll is interesting in this regard that Parker found that a changed object was fixated earlier than if it was not changed, indicating that they suspected where the changed object was before they fixated it.

354

Rayner and Pollatsek

viewing begins, Noton and Slark argued lhat part of the memory for a picture comprises informalion about the sequence actually used to get the information in the first place. According to their theory, the order of fixations on a scene during initial viewing and later recognition should be similar, and they presented some data which were consistent with this idea. However, other studies (see Fursl, 197T; Locher & Nodine, 1974; Parker, 1978; WalkerSmith, Gale, & Findlay, [977) have demonstrated that while the initial viewing and subsequent recognition patterns of fixations may be similar at times, there is no necessity that they be so for accurate recognition to occur (as the strong form of the Noton and Stark theory proposed). Experiments by Loftus (1972) dealing with scene perception and later memory processes have determined that memory for a scene is related to the number of fixations made on the scene: More fixations yielded higher recognition scores. This general finding, along with the fact that there is quite a bit of variation in the pattern of eye fixations on a scene as a function of instructions (Buswell, 1935; Yarbus, 1967), suggests that people continue to extract important information from the scene following the initial few fixations. Whether these fixations are strictly necessary for visual information extraction or are simply aids for highlighting certain information for later memory retrieval is an open question. The above suggests that there may be different patterns of eye fixations early and late in viewing reflecting strategic differences. Antes (1974) reported a pattern of visual exploration whereby subjects initially made many long saccades to fixate on informative parts of the scene with short fixations. This behaviour gradually evolved into fixating informative areas less frequently and examining less informative details via short saccades and long fixations. On the other hand, Nodine, Carmody, and Kundel (1978) found no evidence lhat mean fixation duration or saccade length varied over the course of viewing. Fixation durations were longer throughout, though, on areas rated high in informativeness. INTEGRATION OF INFORMATION ACROSS SACCADES

One fascinating aspect of visual perception is that we perceive a stable, coherent view of the scene we are observing despite the fact that the periods of visual input are separated by eye movements. Somehow the brain is able to smooth out the discrete inputs from each eye fixation. However, it is not clear how information present on fixation n is integrated with the information from fixation n+i to produce this perception of a stable world. A major challenge in visual perception is to specify exactly what types of information are integrated and to construct an adequate model of the mechanisms of integration. A model of information integration lhat has been frequently advocated (Breilmeyer, 1983; Feldman, 1985; Jonides, Irwin, & Yantis, 1982; McConkie

Eye Movements and Scene Perception

355

& Rayner, 1976) is that information is integrated in a visual buffer that combines the information available on successive fixations. McConkie and Rayner (1976) initially referred to this as an integrative visual buffer and various other terms have been used to describe the same type of mechanism. According to the buffer model, low-level featural information (of the type postulated in Marr's, 1982, "primal visual sketch") obtained on fixation n from parafoveal and peripheral vision is stored in the buffer and combined with visual information obtained from the same spatial location on fixation n + I. Thus, subjects presumably keep track of how far they move their eyes and use this information to pool the visual information from the two fixations. While the buffer model has considerable intuitive appeal, most of the research has indicated that integration works in a different manner. The clearest negative evidence comes from research on how integration affects object identification. More specifically, the research asks how information extracted from a parafoveal view of an object is used to enhance identification of the object when it is subsequently fixated. We will term this phenomenon preview benefit. Much of the original work involved words as stimuli (see Pollatsek & Rayner, 1992; Rayner & Pollalsek, 1989 for summaries). While words are admittedly unusual visual stimuli, the findings of the research have generalized to a surprising extent to research with objects, so we will quickly summarize these findings. With words, two paradigms have been used to measure preview benefit. In one (see Rayner et al., 1978), an isolated word is shown in the parafovea which the subject subsequently fixates and names; benefit is measured by a decrease in naming latency over when an unrelated word or string of Xs is displayed in the parafovca. In the second paradigm (see Rayner, Well, Pollatsek & Bertera, 1982), subjects silently read text while their eye movements are monitored. Of crucial interest is the decrease in fixation time on a word when it has been previewed in the parafovea over when an unrelated word or string of xs appeared in the parafovca. In both cases, the word is processed about 30-50 ms faster when it is previewed than when it isn't (the size of the effect depends on the distance of the parafoveal string from fixation). Of greater interest is whether benefit can be obtained if the preview string is not identical to the fixated string; this allows one to determine the kinds of codes involved in the process of integration. Perhaps the most striking finding is that preview benefit is unaffected by physical changes that preserve the lexical identity of the string. That is, preview benefit is unchanged if the case of the letters of the preview is different from that of the target (McConkie & Zola, 1979; Rayner, McConkie & Zola, 1980) and if the preview is not in the same physical location as the target (Rayner ct al., 1978). This indicates that, for words at least, integration of information is not

356

Rayner and PoIIatsek

subserved by something like an integralive visual buffer: if it were, the mismatch of featural information produced by a case change should have produced substantial interference. Other research indicated that Idler strings in the parafovea that were similar to the target word produced partial preview benefit. For example, when the preview string and target siring overlapped by 2 or 3 letters in the initial positions (e.g. CHART is the preview and CHEST is the target), there was sizeable (though not full) preview benefit (Rayner et al., 1980; Rayner et al., 1982). In addition, subsequent research (Pollatsck, Lesch, Morris & Rayner, 1992) has indicated that phonological information is used in the integration process. These findings are consistent with a model which posits that a "neighbourhood" of items in the lexicon is excited by the preview, with entries that are closest to the actual stimuli being excited the most strongly. Thus, a preview of CHART excites the lexical entry for CHEST, though not as strongly as a preview of CHEST itself. The involvement of phonological cues indicates that both a visual lexicon and an auditory lexicon are involved. Similar work was conducted with line drawings of objects (PoIIatsek et al., 1984; Henderson, Pollatselc & Rayncr, 1987; PoIIatsek, Rayner & Henderson, 1990). The findings were quite similar to those from the studies with words. First, a size change of TO% produced no decrement in preview benefit (PoIIatsek et al., 1984), and there was little or no decrement in preview benefit when the location of the preview and target changed (PoIIatsek et al., 1990), indicating that a direct point to point mapping was irrelevant for preview benefit. Second, there was some benefit even when the preview and target only shared the same name (e.g. a baseball HAT and the animal BAT), indicating that phonological codes may also be involved in the integration process for objects. Third, there was partial preview benefit when the preview was visually similar to the target (e.g., a CARROT was a preview for a baseball BAT). However, there was a decrement in preview benefit when the preview and target differed in form but not in meaning. For example, there was some decrement when one line drawing of a horse was a preview for a line drawing of a different horse. Fn addition, there was also some decrement in preview benefit when the preview was a mirror image of the target. The conclusion that emerges from these studies is that integration of objects is not subserved by a spaliotopically organized feature array, as posited by the integrative visual buffer, since an identity of spatial location between preview and target is unimportant for integration; likewise, identity of form is irrelevant for words and identity of size is irrelevant for objects. Instead, it appears that the integration is more abstract, perhaps being mediated by a lexical representation in reading and something like a lexical representation in object detection. In the case of objects, however, the "lexicon" is less abstract than with words, where the lexical entries appear to be defined by a sequence of "abstract" (i.e., case-independent) letters. For objects, the entries

Eye Movements and Scene Perception

357

in the lexicon appear to be defined by something like their form, so that one version of a horse excites the entry of another version of a horse, but not fully. Similarly, a carrot in the parafovca excites a bat, but not as fully as would a preview of a bat. The "lexicon" we are proposing for objects is clearly in danger of being unworkably large, since it appears to contain entries for different kinds of horses and perhaps different entries for objects in differing views. However, it is not necessary for it to contain all possible representations, just a set dense enough that an entry is "near enough" to any recognizable stimulus to be excited strongly. Moreover, there is physiological evidence (Gross & Mishkin, 1977) that indicates that there may indeed be orientation-specific object detectors (at least for important stimlui such as faces). The research we have reviewed does not exhaust the topic of information integration, since it merely deals with object identification. Perhaps the information for objects is integrated in a relatively abstract way, but other scene information is integrated in a more "visual" way. There have been a series of experiments that have attempted to get at this issue by investigating either whether "pieces" of objects can be integrated across fixations or whether lower-level featural information can be. Our opinion, however, is that the bulk of the evidence is negative. Jonides, Irwin, and Yantis (1982) reported some results that were initially taken as evidence for a low-level integrative visual buffer. Subjects were asked to fixate on a target in the centre of vision while 12 dots appeared in parafoveal vision, filling half the cells of an imaginary 5 x 5 matrix. Subjects were instructed to move their eyes to the array and when they did so, the 12 initially presented dots were replaced by 12 new dots filling another 12 cells of the matrix. The subjects' task was to indicate which dot from the 5 x 5 array was missing. High performance in this task could only be achieved if subjects were holding the first 12 dots in a spatiotopic buffer and were integrating that information with the second 12 dots available following the saccade. Jonides et al. found that subjects were able to report the location of the missing dot with an extremely high degree of accuracy, whereas they were about at chance in a control condition which mimicked the retinal events in the above condition but there was a change in spatial location between the two arrays. (Subjects held fixation while the first 12 dots were presented parafoveally and then the second r 2 dots were presented foveally.) These results, thus, seem like good evidence for an integrative low-level visual buffer. Unfortunately, it soon became apparent that the effect reported by Jonides et al. (and a similar one reported by Breitmeyer, Kropfl & Julesz, 1982) was due to an artifact of the procedure they used. A number of studies (Bridgeman & Mayer, 1983; Irwin, Yantis, & Jonides, 1983; Jonides, Irwin, & Yanlis, 1983; O'Rcgan & Levy-Schoen, 1983; Rayner & Pollatsek, 1983) demonstrated that the original results were due to screen persistence in the

358

Rayner and Pollatsek

Cathode Ray Tube (CRT) used to present the stimuli. When appropriate controls were used, the integration effect became nonexistent. Other studies (Davidson, Fox & Dick, 1973; Ritter, 1976; Wolf, Hauske, & Lupp, 1978, 1980) dealing with low-level visual processing reported findings consistent with the intcgralive visual buffer theory. For example, Wolf ct al. appeared to demonstrate that information about spatial frequency was retained across saccades. Subjects saw a grating in the parafovea which apparently lowered their thresholds lo gratings of the same frequency when seen fovcally. However, this finding could not be replicated when phosphor persistence was eliminated as a possible explanation (Irwin, Zacks & Brown, 1990). Similarly, Davidson et al.'s widely cited report that information is spatially aligned across fixations did not replicate in a more closely controlled study (Irwin, Brown & Sun, 1988). Other experiments by Irwin and colleagues (Irwin, 1991; Sun & Irwin, 1987) have likewise failed to find evidence for spatiolopic informational persistence (see Irwin, 1992, for a review; see also Van der Heijden, Bridgeman & Mewhorl, 1986, for arguments against spatiotopic informational persistence). There thus appears to be no evidence for an integrative visual buffer. In fact, we think that a good case can be made that, upon reflection, it would be of little use in the above tasks. First of all, for the integration process to be of much value, the two images would have to be aligned fairly precisely. One way this could be accomplished is if there was a precise record of how far the eye moved; however, eye guidance is not particularly precise (eye movements usually undershoot the target and there is also random variability). Another way that the alignment could be achieved is if there were a preliminary analysis of each image to determine which parts of the images corresponded. Such an analysis, however, would be non-trivial and likely to be more difficult than forgetting the visual information from the first fixation and starting from scratch on the second information to extract whatever information is needed from the visual world. Even if the alignment problem were removed, there would still be a problem in integrating fuzzy featural information from the parafovea and periphery with clear featural information from the fovea. Moreover, it is likely that there are systematic distortions coming from the optics of the visual system that would produce differences between the parafoveal and foveal images. Thus, it makes a lot of sense for the visual system to wipe away the fealural information from the prior fixation and start with a fresh slate on the new fixation. Since you don't want amnesia, however, some information needs to be preserved. If the goal is object detection, then what would be most beneficial is information about what the object is likely to be. Our hypothesis is that excitation in a neighbourhood in a lexicon for objects is one way of providing such information. An alternative proposal using a distributed representation metaphor would be a pattern of excitation in a set of "hidden

Eye Movements and Scene Perception

359

units" that subserves pattern recognition. In either case, the "visual information" is preserved by the pattern of excitation; that is, if the set of objects excited in the object lexicon arc a baseball bat, carrot, screwdriver, etc., one has preserved the information that the object is long and thin. While object identification is an important part of visual perception, it is not the whole process. In addition to identifying objects, we need to understand the visual world as an array of objects located in space, and the relationships of the objects (e.g. object X is resting on object Y). Can it be that no visual information needs to be preserved from fixation to fixation lo help achieve this representation? For example, how does one know where an object is once the visual information is gone? Thus, it seems plausible that some sort of visual representation survives a saccade. There is some evidence, however, that this information may not be very precise. For example, a scene can be moved during a saccade up to about one-third of the distance of the saccade without the subject noticing the display change (Bridgcman, Hcndry & Stark, 1975; Bridgeman & Stark, 1979). Similarly, the 10% size changes we employed in our experiments were rarely noticed. Larger changes, however, are noticed; for example, we informally noticed that object size increases of 100% across fixations were often perceived as "looming". This may indicate that some relatively raw "spatial information" is preserved in a relatively low resolution system, possibly the collicular system. Two recent experiments, however, have suggested that more precise information is used in the integration process. In one, Hayhoe, Lachtcr and Feldman (1991; see also Hayhoe, Lachter & Moeller, 1992) reported that subjects can integrate the locations of dots seen on three successive fixations lo determine whether the angle formed by the dots is a right angle or not. In another. Palmer and Ames (1989) reported that subjects can preserve information about the length of a line across fixations. While these reports arc suggestive, in either case, one could argue that a "non-visual" form of coding was used to perform the task. In Hayhoe et al.'s experiment, subjects may have used the pattern of eye movements to do the task; in addition, the fact that one subject was virtually unable to do the task suggests that the task may not have been tapping some fundamental property of the visual system. In Palmer and Ames's task, subjects might have converted line lengths to an abstract code such as a number; since there were a relatively small number of lengths employed in the task, this may have been a relatively efficient procedure. In summary, the bulk of the evidence suggests that for scene perception where meaningful information is presented, integration of information occurs at an abstract level of representation rather than at a visual level. SCENE CONTEXT, OBJECT IDENTIFICATION AND F.YH MOVEMENTS

As we mentioned in the introduction, the question of how the eyes move in

360

Rayner and Pollatsek

viewing scenes is ill-defined, since there is no well-defined lask of scene viewing. A sub-qucslion (hat has been pursued, however, is how scene context affects the perception of objects and how this interaction influences the pattern of eye movements. The interest in eye movements has two components. The first is that fixation time on an object is a candidate dependent variable for measuring the time to identify that object. The second is that the eye movement record is of intrinsic interest in understanding the process of scene perception. Basically, two tasks have been employed for studying the effects of context on object identification. The first is a memory task; subjects are allowed to study a picture for a later recognition memory test. In this task, object identification has been measured primarily by the duration of fixations on objects in the scene. The second is some variation of a visual search or identification task; the subject cither searches the visual display for a given object or objects or is told (in some fashion) to identify a given target object in the display. In these tasks, objecl identification has been measured in several ways: fixation time on an object, the probability of correctly identifying an object, and/or some sort of manual or vocal response time measure. The data from the recognition memory experiments indicate that fixation time on an object is indeed influenced by whether the object is congruent with the scene context (Antes & Penland, 1981; Friedman, 1979; Loftus & Mackworth, 1978). In the standard experiment, fixation time on an object (e.g. a sofa) when it belongs in a scene (such as a scene depicting a living room) is less than the fixation time on that object when it doesn't belong in the scene (e.g. a sofa in a street scene). Following Biederman el al. (1982), we will refer to the former instance as a "normal" object and the latter as an object "in violation" of some properly of scene coherence. This result clearly demonstrates that eye movements reflect on-line processing of the scene. One can not be certain, however, whether the longer fixation times on the objects in violation of the scene reflect longer times to identify those objects or longer times to integrate them into a global representation of the scene or something else (e.g., the longer fixation times on the objects in violation might register amusement at the absurdity of the object in that context). This concern is buttressed by two observations. The first is that the experiments employ a memory task; thus the subject is likely to be doing some sort of "memorization" and/or covert rehearsal on many fixations. The second is (hat fixation times on objects in these experiments are usually quite long (first fixations on objects are on average 400 ms); thus, it is plausible that much of the time spent on objects reflects processes other than object identification. The focus of most of the other experiments on scene contexts has been to demonstrate that context operates at Ihe level of object identification. Since we have no task that magically taps only object identification, there is still

Eye Movements and Scene Perception

361

some controversy about whether such an effect has in fact been demonstrated; however we think that the weight of the evidence points to the conclusion that scene context can modulate the process of object identification. The memory experiments described above allowed subjects to freely examine scenes. At the other extreme of control is the paradigm in which the scene is presented for a short time in order to preclude eye movements. In these experiments, as mentioned earlier, the scene is flashed briefly, followed by a spatial marker. The subject is asked to judge whether a prespecified target object was present at this location. The central finding is that the target object is more accurately identified when it belongs in the scene than when it is in violation (Biederman et al. 1982; Boyce et al., 1989). One concern with this paradigm is that "identification" of the target object may reflect a problem solving process analogous to crossword puzzle solution involving perceived fragments of the target object and the scene context rather than normal perceptual identification. Since the process studied is reasonably fluent (response times in Biederman et al. were on the order of a second), we think that it is unlikely that conscious problem solving is frequently occurring in this task. However, it is still a concern that identification of objects in this paradigm may not reflect the same process as that in many situations where people view objects in scenes. That is, in most tasks, if identifying an object becomes important to a viewer, the easiest thing to do is to fixate the object. Indeed, in most situations where there is free viewing, subjects tend to fixate informative regions of the scene. Thus, since it is likely that most objects of interest are fixated during "normal" scene viewing, it seems of interest to investigate the issue of context when an object is fixated. Two recent paradigms have been developed to study foveal object identification in scene perception. In the first, the "wiggle" paradigm (Boyce & Pollatsck, 1992a, 1992b), the subject fixates in the middle of the scene and 75 ms later the target object "wiggles" (moves a fraction of a degree of visual angle and returns to its original position). The subject's task is to name the wiggled object. In fact, subjects invariably move their eyes to fixate this object prior to naming the object. The central finding is the same as that from the brief presentation methods above: The target object was named more rapidly when it was in a normal scene context than when it was in the wrong scene (Boyce & Pollatsek, 1992b). Both the brief presentation paradigm and the wiggle paradigm indicate that context can have an immediate effect on the processing of objects. Moreover, since the subject's task in both paradigms is to identify the target object rather than to memorize a scene, it seems plausible that the influence of context observed is on the process of object identification. It is possible, however, given that the identification times for objects in both paradigms are relatively long (roughly a second), that the context effect observed may be on stages beyond the identification stage.

362

Rayner and Pollatsek

Another paradigm used by De Gracf, Christiaens and d'Ydewallc (1990; see also De Graef, 1992) to investigate scene context effects is a variation of the "object decision" task (Kroll & Potter, 1984). In the task used by De Graef et al., subjects examined a scene and responded whenever they detected a "non-object" in the scene. A non-object is a line drawing of a three-dimensional geometric entity that does not correspond to any object they have seen. Of central interest in the study was not the processing of non-objects, but the processing of the objects. In particular, as in the recognition memory experiments, the key question is whether objects that belong in a scene are fixated for a shorter period of time than objects that don't belong. De Gracf et al. findings are somewhat complex, but the main point is that scene context does affect the fixation time on an object, but only after the subject has been viewing the scene for a while (roughly 10 fixations). Thus, De Graef et al. also found that scene context can affect fixation times on objects (and presumably the identification time for an object), but only after extended viewing of a scene. The latter aspect of their finding contradicts those using both the brief presentation method and the wiggle paradigm, which found that the effects of scene context were quite immediate. Space precludes a complete discussion of this issue (see Boycc & Pollatsek, 1992a, for a more complete analysis). However, what we think it indicates is that a scene context can have an immediate effect on object processing, but it need not; in other words, processing of the scene background can be rapid, but under certain circumstances subjects may either ignore it or process it more slowly. For example, employing a variant of the brief presentation method, Malcus, Klatsky, Bennet, Genarelli and Biedcrman (1983) demonstrated that subjects ignored the background (i.e. showed no effect of the coherence of the background) when instructed to ignore it. Similarly, the nonobject detection task apparently induced subjects either to ignore the background or to process it locally during their early viewing of the scene.4 The other explanation of the discrepancy between the De Graef et al. result and the others is that the fixation measure in their task is uniquely capturing object identification and the context effects in all other tasks are contaminated by post-identification effects. While this is possible, we feel it is unlikely. For example, Henderson et al. (1987) found at least as large effects of semantic priming on fixation durations of objects as on naming time. Thus, there is no compelling evidence that naming time picks up more spurious context effects than does fixation duration. A second major question, of course, is how one explains these context effects on object identification (assuming, of course, thatsuch context effects have been demonstrated). Perhaps a convenient null hypothesis to entertain is 4 De Graef, 1992, explains (he slow acquisition of scene information in his study as the result of gradual accumulation of local information.

Eye Movements and Scene Perception

363

thai these effects are completely explained by a mechanism of "object-toobject priming", whereby semantic nodes are activated by some objects, which in turn activate nodes of other semantically related nodes by a process such as "spreading activation" (Collins & Loftus, 1975). Henderson et al. (1987), in fact demonstrated that either the lime spenl fixating a target object or the time to name a target object was reduced if a semantically related object was fixated immediately prior to the target object. There are several pieces of evidence that suggest that the object-to-object priming hypothesis is wrong, or at least, incomplete. First, scene context effects have been obtained independently of component objects. Boyce et al (1989) constructed scene backgrounds which had no local information that defined the background; in spite of this, reliable context effects were obtained in the brief presentation paradigm. Moreover, Boyce et al. found that the objects that were companions to the target object had no effect; object identification was as accurate when the companions were episodically unrelated to the target as when they were episodically related to the target (a similar result was obtained by Biederman et al., 1982). These findings indicate that the scene context effects normally observed are probably due to global information rather than local information, especially if the local information is not near fixation. An open question is whether global information will dominate when local object information is processed foveally. A defender of object-to-objecl priming might posit, however, that scene backgrounds are merely large objects which are easy to identify and facilitate object identification through a priming mechanism. Two lines of evidence argue against this. The first is that Biederman et al. (1982) demonstrated (see also De Graef et al., 1990) that objects that are in the wrong place in the right scene are harder to identify than objects in the right place. This would be hard to explain given a priming explanation, which would predict priming from node to node regardless of the geometric configuration of the objects. Instead, it seems to argue that context effects work at least partially through an episodically constructed visual representation rather than a semantic or lexical network. The second piece of evidence is that Boyce et al (1989; see also Boyce & Pollatsek, 1992b) found quite low correlations between the rated strength of association between scene and object and the size of the actual context effect. This suggests that the mechanism is not one of activation flowing unidirectionally from one node to another. Instead, it suggests that context effects might be some sort of mutual excitation between various nodes involving some sort of parallel process of mutual constraint satisfaction. The above is clearly speculative and needs better definition. One attempt to get at the mechanism involved when context influences the processing of objects was that of Biederman et al. (1982), who manipulated both the types of violations objects underwent in scenes, and the number of them. Two findings are worth noting. The first is that certain types of anomaly appear to

364

Rayner and Pollatsek

make no difference. For example, when an object is "transparent" (i.e., the background could be seen through it), it was no harder to identify than when it was normal. This suggests that the episodic representation that mediates object identification is not a "full" description of the scene, but instead only contains some of the geometric information in the scene. Another possibility, however, is that certain properties of an object (such as transparency) are hard to encode and are often registered more slowly than a "basic level" categorization of the object, but once encoded do form a part of the scene representation. The second point of interest is that multiple violations of an object (e.g. if an object is both in the wrong scene and in an improbable location such as floating in air) result in poorer identification than when there is only one violation. As pointed out by De Graef et al. (1990), this indicates that the representation that is mediating is unlikely to be a "schema" in the usual meaning of the term (i.e. a prepackaged representation of a scene). That is, once an object does not fit into a schema (e.g. a sofa in a street scene), it should not make any difference whether it is in the middle of the street or in the sky; in neither position is it remotely belonging to a prestored schema. Instead, greater effects of multiple violations over single violations suggest (as argued earlier) that the representation mediating object identification is an on-line construction of an episodic representation of what is being viewed. The fact that semantic and geometric anomalies appear to both have an impact indicates that such information is computed in parallel (as argued by Biederman et al. 1982). What is less clear, however, is what the representation would be like that could coordinate the semantic and geometric information. THE CONTROL OF EYE MOVEMENTS

There is no simple answer to what the pattern of eye movements is while viewing a visual display, since the eye movement pattern clearly depends on the task (Buswell, 1935; Yarbus, 1967). However, we will propose as a working hypothesis that the basic mechanism of eye movement control is the same across tasks (including reading), even though the resulting pattern may differ.5 The model of eye movement control we will advocate has been proposed by Morrison (1984) to account for the basic pattern of eye movements in reading. Although there are some difficulties with this model even in accounting for some of the subtleties in the pattern of eye movements in reading (see Pollatsek & Rayner, 1990, for an extensive discussion), we believe that Morrison's model provides a useful framework for discussing eye 5 In making this suggestion, we are fully aware of the fact that there are important differences between reading and scene perception (Loftus, 1983; Rayner, 1984) and that making generalizations from one situation to the other is problematic.

Eye Movements and Scene Perception

365

movements in static displays in general.6 The essence of Morrison's model is as follows. He posits the following sequence of operations during a fixation: a) the Fixated word is encoded; b) when encoding is completed (to some satisfactory criterion), attention moves to the next word and a motor program to move to the next word is created; c) after a variable lag, the eye movement program in (b) is executed. Thus, the basic process of reading is identifying a word, shifting attention to the next word when you are done and the eye following covert attention a short time later (roughly 75-iooms). Additional assumptions that are too complex to go into here can account for word skipping; the basic idea is that a later motor program can cancel an earlier program if they co-occur within a certain temporal window (see Pollatsek & Rayner, 1990, for a more complete exposition of these issues). As we discussed earlier, the pattern of data encountered with free viewing of simple visual displays suggests that the basic processing sequence is similar to that in reading. Henderson et al. (1989) found that when subjects tried to encode a display of four objects, they almost always fixated each object in turn. Moreover, the attentional pattern inferred from contingent display changes in Henderson et al.'s studies indicated that subjects' attention was virtually restricted to the object fixated and the one about to be fixated. Moreover, fixation durations were sensitive to whether successive objects were semantically related, suggesting that the signal to move the eyes was identification of the fixated object. This view is reinforced by the data reviewed earlier that demonstrated that context influences fixation time on objects in free scene viewing. A complete model of eye movement control should be able to account for both when an eye movement occurs and where the eyes go to. As discussed above, the "when" appears to determined by "satisfactory" identification of the fixated word or object. Even for reading, however, Morrison's model is a bit imprecise on the issue of "where". He assumes that the eyes should move to somewhere on the next word, but does not say much about how that intention is converted into spatial coordinates. Within that limitation, however, Morrison's model accounts for reading data quite elegantly and is consistent with the Henderson et al. (1989) data as well. In some sense, of course, it does so because the sequence of eye movements is heavily constrained by the task in both cases. In reading, the subject is presumably attempting to encode words sequentially from left to right on a line and in the Henderson et al. experiment, the subject is told to encode the stimuli in a certain sequential order. We should emphasize, though, that Morrison's model is not trivial; it successfully predicts that the reader's goal of fixating most (but not all) of 6 See Henderson (1992) for similar arguments about how the model can be extended to scene perception.

366

Rayner and Pollatsek

these words and/or objects that should be processed can be achieved by a "dumb" mechanism that requires no cognitive effort. The model is incomplete, however, for the free viewing situation, because it contains no assumptions about "where" the next fixation will be when the task does not explicitly prescribe it. As we indicated earlier, much of the early work on eye movements in scene viewing indicated that the pattern is not random and that people tend to fixate areas of the scene that are judged as "informalive" either by the experimenter or by other subjects. A basic problem with these experiments (as noted earlier), however, is that "informative" is not very well-defined. In particular, in most of the displays used, "informative" areas appear to be both semanlically important (e.g. contain objects whose identity is important to decode to understand the scene) and visually striking (e.g. contain lots of brightness changes, definite contours, etc.). Thus, it is quite unclear which level of processing is in fact guiding the eyes in these situations. There is a similar problem, as we noted earlier, in the Loftus and Mackworth (1978) study which purports to demonstrate that semantic aspects of objects guide the pattern of fixations. They report that subjects fixate a semantically anomalous object earlier than one that belongs in the scene. However, in the examples of scenes that they presented, it seems that the anomalous objects are physically distinctive as well. In general, this problem is difficult to resolve, since "semantic" and visual aspects of scenes and objects are usually highly confounded, unlike the situation with symbolic stimuli like words, where the visual aspect is arbitrarily related to the meaning. To summarize, the data on scene viewing suggest that the location of fixations is not arbitrary, although they are not particularly diagnostic about whether low level aspects of the display such as brightness disparities, contours, etc. guide the eye or higher level aspects such as the meaning of objects or their relation to the meaning of the scene. We suspect that either level of analysis can guide the eye in a particular task. Consider the following thought experiment. You arc given a circular array of objects around a central fixation point and told to maintain fixation until you find a target object (such as a lion) and then move your eyes to that object. We suspect that this task can be done, but that the latency of the eye movement would be quite long (probably on the order of a second or more). Since fixation times on objects in scene viewing are usually fairly short (200500 ms or so), we find it probable that lower-level information is what usually drives the eye (Findlay, 1981, 1982; He & Kowler, 1989). Certainly, if something moves (as in the wiggle experiments of Boyce and Pollatsek, 1992b), it drives the eye. However, even with static displays, we think it is plausible that lower-level information usually dominates eye control. Since the viewer is concentrating on processing the fixated information, it would make sense to leave the job of deciding where to fixate next to a relatively "dumb"

Eye Movements and Scene Perception

367

process operating on information in the parafovea and periphery such as brightness differences or contours that should use little or no central capacity (Trcisman & Gelade, 1980). There are some data that support our contention that low-level processes dominate the decision of where to move the eyes next. For example, in a visual search task, Engel (1977) showed that the tendency to fixate the object closest to fixation was very coercive. Similarly, Rayner and Fisher (1987) had subjects do a visual search task through strings of letters that were arrayed similarly to text as blocks of letters separated by spaces. They found that subjects tended to fixate each letter siring in turn, even when the target was defined by a cue that was highly visible (the presence of a descender). In addition, He and Kowler (1989) found that it was difficult for subjects to direct their eye movements based on expectations or higher-level analyses of the stimulus if they had to move their eyes reasonably rapidly. These data all suggest that the following expansion of Morrison's model might be a sketch of how the eyes move in free viewing. The subject inspects an object (or some other significant area of a display). When this is identified, a command is sent out to move the eyes. For the most part, the location of where to move is guided by low-level information such as the presence of the nearest large discontinuity in the brightness pattern, a near significant contour, etc. This is selected as the "next object" to be processed in the scene. Thus, the usual pattern will be to move from one object to the nearest object. Given that there appears to be an "inhibition of return" mechanism (Posner, T988) whereby the location just processed is inhibited by the attentional system, the subject is saved from oscillating between two nearby objects. It seems plausible that this "dumb" mechanism—fixate the nearest significant region but avoid the one just fixated—-could insure that most of the significant areas of a display would be inspected in a reasonably small number of fixations and thus be an adequate model of eye control in the ecological sense. We suspect that in tasks such as visual search, where it is important to inspect all the objects efficiently, a fairly different type of control mechanism is involved, where the eyes are programmed to go left-to-right, top-to-bottom or something of that nature. We view the process sketched in the prior paragraph as a model of the default process in scene viewing, much as the Morrison model is a model of the default process in reading. However, it is clear in reading that the usual process of going along more or less word by word is disrupted when the reader encounters difficulty (such as a realization that one has misparsed the sentence or does not know the antecedent of a pronoun). In such cases, large regressive eye movements are often made, sometimes fairly directly to the spot from which the information to solve the problem is located (Frazicr & Rayner, 1982; Rayner, Sereno, Morris, Schmauder & Clifton, 1989; Carpenter & Just, 1977). Thus, even in reading, the Morrison model would need to be

368

Rayner and Pollalsek

modified to allow "higher-order" cognitive processes to lake over from the "dumb" default mechanism when there is a problem (sec Pollatsek & Rayner, 1990; Rayner & Balota, 1989). Similarly, in picture viewing, there are undoubtedly some fixations on which a more intelligent "cognitive" mechanism takes over from the default mechanism that operates on low-level information. This could either happen because of task demands (e.g. you are told that the target object is on a red background, so you try to suppress the default mechanism and fixate only red objects) or because some higher-order process interrupts the ongoing default process in mid- task because something seems to be wrong. Two possibilities for what might cause such an interrupt are: a) a realization that either an object has been misperceived or that the global meaning of the scene is unclear, and thus a decision is made to check out a particular object or anomalous region of the scene; b) a realization that no new information is being encoded, so a possibly conscious decision needs to be made to move the eyes a long distance. There is another way in which the above picture is incomplete. In reading, processing is quite local, involving only the word fixated and the next word or two, whereas in picture processing, global information can be extracted. The work of Boycc and Pollatsek (1992b) indicates that global information about the scene background is extracted continually, at least on the first two fixations on a scene. This suggests thai processing of scenes involves not only an initial view of the background followed by encoding of the objects, but instead a process whereby the object information is, in some sense, continually related to the background. This observation has two implications. First, it indicates that the attentional model implicit in Morrison's model has to be modified in some way. One possibility is to posit relatively independent global and local information channels which operate at varying spatial frequencies. Given this assumption, one would describe Morrison's model of reading as positing that readers ignore the global channel, since global information is probably of little use except to remind the reader where the edges of the page of lext are. Data we discussed earlier indicate, however, lhat scene context is not necessarily processed (De Graef et al., 1990, Malcus et a)., 1983). This suggests that the global channel is not completely independent of !he local channel and that it competes with it for resources (in some sense). The second implication is that the representation of a scene in memory is likely to be a relatively parallel structure involving complex interrelations between the global information and the detail extracted from each fixation. This speculation is supported by the data indicating lhat memory for a picture is influenced very little by whether the scanpath is the same on test as it was in initial inspection of the scene (sec Parker, 1978). If preserving the scanpath from inspection to test was important, it would suggest that the structure of the memory representation was something akin to a serial list. Since it is not,

Eye Movements and Scene Perception

369

it suggests that the memory structure is organized by some different principle. One possibility is that the structure is like a bulletin board, where the individual objects are attached (like memos) to the basic scaffolding of the scene background. SUMMARY

In this article, we have attempted to review the work on eye movements and scene perception. While there are ways of investigating scene perception without studying eye movements (and we have discussed several such paradigms), we believe that it is necessary to study eye movements to achieve a full understanding of scene perception. The situation is similar to that in reading, where paradigms using timed responses to individual words have allowed us to understand quite a bit about word identification. In addition, reading paradigms have been devised to get at global processing that circumvent the necessity of measuring eye movements (such as sequential word-by-word presentation of text). Analogous paradigms seem even more promising in scene perception since, as we have argued, global information can be extracted from an individual fixation; hence these paradigms have significant potential to tell us something about global understanding of scenes. However, if the question of interest in reading is how people process text when they read naturally (i.e. reading silently), analyzing the pattern of eye movements is really the only technique available for answering the question. Similarly, if the question of interest is how people process scenes in the real world, understanding the pattern of eye movements will be an important part of the answer. We must caution that the relationship between the pattern of eye movements and processing is not trivial. Fixation times on objects are not simply object identification times and we are still quite unclear as to what the signals are that tell the eyes where to move. Our speculative last section, however, indicated that we believe that the basic mechanism of eye control may be relatively simple. In particular, if the eyes are usually moving around to identify objects (as we postulate) and are only occasionally driven by more global considerations of scene comprehension, there is good reason to believe that the pattern of eye movements will be amenable to studying both types of processes (as is the case in reading). This writing of this article and of many of the experiments reported was supported by Grant HD26765 from the National Institue of Health. We wish to thank our collaborators, Susan Boyce, Peter Dc Oraef, and John Henderson, who made much of this research possible and contributed to (he ideas of this paper. However, we assume full responsibility for all errors. Requests for reprints should be sent to Keith Rayner, Department of Psychology, University of Massachusets, Amhcrst, MA 01003, USA.

370

Rayner and Pollatsek

References Antes, J. R. (1974). The time course of picture viewing. Journal of Experimental Psychology, 103, 62-70. Antes, J. R., and Penland, J. G. (1981). Picture context effects on eye movement patterns. In Fisher, Monty and Senders (Eds.), Eye Movements: Cognition and Visual Perception, Hillsdalc, NJ: Hrlbaum. Balota, D.A., Pollatsek, A., & Rayner, K. (1985). The interaction of contextual constraints and parafoveal visual information in reading. Cognitive Psychology, '7, 364-390.

Biedcrman, I. (1972). Perceiving real-world scenes. Science, IJJ, 77-80. Biederman, I., Mezzanotte, R. J., and Rabinowitz, J. C. (1982). Scene perception: Detecting and judging objects undergoing violation. Cognitive Psychology, 14, 143-177Boyce, S.J. & Pollatsek, A. (1992a). An exploration of the effects of scene context on object identification. In K. Rayner (Ed.), Eye movements and visual cognition: Scene perception and reading. New York: Springer-Verlag. Boyce, S.J. & Pollatsek, A. (1992b). The identification of objects in scenes: The role of scene backgrounds in object naming. Journal of Experimental Psychology: Learning, Memory and Cognition, iS , 531-543. Boyce, S. J., Pollatsek, A., and Rayner, K. (1989). Role of background information on object identification. Journal of Experimental Psychology: Human Perception and Performance, 15, 556-566. Breitmeyer, B.G. (1983). Sensory masking, persistence, and enhancement in visual exploration and reading. In K. Rayner (Ed.), Eye movements in reading: Perceptual and language processes. New York: Academic Press. Breitmeyer, B.G., Kropfl, W. & Julesz, B. (1982). The existence and role of retinotopic and spatiotopic forms of visual persistence. Acta Psychologica, 52, 175196. Bridgeman, B., Hendry, D., & Stark, L. (1975). Failure to detect displacement of the visual world during saccadic eye movements. Vision Research, 15, 719-722. Bridgeman, B. & Mayer, M. (1983). Failure to integrate visual information from successive fixations. Bulletin of the Psychonotnic Society, 21, 285-286. Bridgeman, B. & Stark, L. (1979). Omnidirectional increase in threshold for image shifts during saccadic eye movements. Perception & Psychophysics, 25, 241248. Buswcll, G.T. (1935). Haw people look at pictures. Chicago: University of Chicago Press. Campbell, F.W., & Wurtz, R.H. (1978). Saccadic omission: Why we do not see a grey-out during a saccadic eye movement. Vision Research, j8, 1297-1303. Carpenter, P.A., & Just, M.A. (1977). Reading comprehension as eyes sec it. In M.A. Just and P.A. Carpenter (Eds.), Cognitive processes in comprehension. Hillsdalc, NI: Erlbaum.

Eye Movements and Scene Perception

371

Carroll, P.J., Young, J.R., & Gucrtin, M.S. (1992). Visual analysis of cartoons: A view from the far side. In K. Rayner (Ed.), Eye movements and visual cognition: Scene perception and reading. New York: Springer-Vcrlag. Collins, A., & Loftus, E. (1975). A spreading activation theory of semantic processing. Psychological Review, 82, 407-428. Davidson, M.L., Fox, M.J., & Dick, A.O. (1973). The effect of eye movements and backward masking on perceptual location. Perception & Psychophysics, 14, 110116.

De Graef, P. (1992). Scene-context effects and models of real-world perception. In K. Rayner (Ed.), Eye movements and visual cognition: Scene perception and reading. New York: Springer-Vcrlag. Dc Gracf, P., Christiaens, D. & d'Ydewalle, Ci. (1990) Perceptual effects of scene context on object identification. Psychological Research, 52, 317-329. Engel, F.L. (1977). Visual conspicuity, directed attention and retinal locus. Vision Research, ij, 95-108. Fcidman, J.A. (1985). Four frames suffice: A provisional model of vision and space. Behavioral and Brain Sciences, 8, 265-289. Findlay, J. (1981). Local and global influences on saccadic eye movements. In D.F. Fisher, R. A. Monty, & J.W. Senders (Eds.), Eye movements: Cognition and visual perception. Hillsdalc, NJ: Rrlbaum. Findlay, J. (1982). Global processing for saccadic eye movements. Vision Research, 22, [033-1045. Fischer, B. (1992). Saccadic reaction time: Implications for reading, dyslexia, and visual cognition. In K. Rayner (Ed.), Eye movements and visual cognition: Scene perception and reading. New York: Springer-Verlag. Frazicr, L. & Rayner, K. (1982). Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cognitive Psychology, 14, [78-210. Friedman, A. (1979). Framing pictures: The role of knowledge in automatized encoding and memory from gist. Journal of Experimental Psychology: General, 3, 3«6-355Friedman, A. & Liebelt, L. S. (1981). On the time course of viewing pictures with a view towards remembering. In D.F. Fisher, R.A. Monty, and J.W. Senders (lids.), Eye movements: Cognition and visual perception. Hillsdale, NJ: Erlbaum. Furst, C.J. (1971). Automatizing of visual attention. Perception & Psychophysics, to, 65-70. Gross, C.Ci. & Mishkin, M. (1977) The neural basis of stimulus equivalence across retinal translation, In S. Hamard, R.W. Doty, L. Goldstein, J! Jaynes & G. Krauthamcr (Eds.), Lateralization in the Nervous System, New York: Academic Press. Hayhoe, M., Lachter, J., & Feldman, J. (1991). Integration of form across saccadic eye movements. Perception, 20, 393-402.

372

Rayner and Pollalsek

Hayhoc, M., Lachter, J., & Moellcr, P. (1992). .Spatial memory and integration across saccadic eye movements. In K. Rayner (Ed.), Eye movements and visual cognition: Scene perception and reading. New York: Springer-Vcrlag. He, I'., & Kowler, E. (1989). The role of location probability in programming of saccades: Implications for "center-of-gravity" tendencies. Vision Research, 29, 1165-1181. Henderson, J.M., & Ferrcira, F. (1990). Effects of fovcal processing difficulty on the perceptual span in reading: Implications for attention and eye movement control. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16. 4I7-429Henderson, J. M., Pollatsek, A., and Rayner, K. (1987). The effects of foveal priming and cxtrafoveal preview on object identification. Journal of Experimental Psychology: Human Perception and Performance, 3, 449-463. Henderson, J. M., Pollatsek, A., and Rayner, K. (1989). Covert visual attention and extrafovcal information use during object identification. Perception & Psychophysics, 45, 196-208. Inhoff, A.W., Pollatsek, A., Posner, M.I., & Rayner, K. (1989). Covert attention and eye movements in reading. Quarterly Journal of Experimental Psychology. 41A, 63-89. Irwin, D.E. (1991). Information integration across saccadic eye movements. Cognitive Psychology, 23, 420-456. Irwin, D.E. (1992). Visual memory within and across fixations. In K. Rayner (Ed.), Eye movements and visual cognition: Scene perception and reading. New York: Springcr-Vcrlag. Irwin, D.E., Brown, J.S. & Sun, J-S. (1988). Visual masking and visual integration across saccadic eye movements. Journal of Experimental Psychology: General, 117, 276-287. Irwin, D.E., Yantis, S., & Jonides, J. (1983). Evidence against visual integration across saccadic eye movements. Perception & Psychophysics, 34. 49-57. Irwin, D.E., Zacks, J.L. & Brown, J.S. (1990). Visual memory and the perception of a stable visual environment. Perception & Psychophysics, 47, 35-46. Jonides, J., Irwin, D.E., & Yantis, S. (1982). Integrating visual information from successive fixations. Science. 2/5, 192-194. Jonides, J., Irwin, D.E., & Yantis, S. (1983) Failure to integrate information from successive fixations. Science, 222, 188. Kroll, J. F., and Potter, M. C. (1984). Recognizing words, pictures and concepts: A comparison of lexical, object and reality decisions. Journal of Verbal Learning and Verbal Behavior, 23, 39-66. Lochcr, P.J., & Nodine, C.F. (1974). The role of scanpaths in the recognition of random shapes. Perception A Psychophysics, 15. 308-314. Loftus, G.R. (1972). Eye fixations and recognition memory for pictures. Cognitive Psychology. 3. 525-551.

Eye Movements and Scene Perception

373

Loftus, G.R. (1983). Eye fixations on text and scenes. In K. Rayner (Ed.), Eye movements in reading: Perceptual and language processes. New York: Academic Press. Loftus, G. R., and Mackworth, N. H. (1978). Cognitive determinants of fixation location during picture viewing. Journal of Experimental Psychology: Human Perception and Performance, 4, 565-572. Mackworth, N.H., & Morandi, A.J. (1967). The gaze selects informative details within pictures. Perception A Psychophysics, 2, 547-552. Malcus, L., Klatsky, G., Bennett, R.J., Gcnarelli, E. & Bicdcrman, I. (1983) Focused attention in scene perception: Evidence against automatic processing of scene backgrounds. Paper presented at the 54th Annual Meeting of the Eastern Psychological Association; Philadelphia. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. San Francisco: W.H. Freeman. Matin, E. (1974). Saccadic suppression: A review and an analysis. Psychological Bulletin, 81, 899-917. McConkie, G.W., & Rayner, K. (1975). The span of the effective stimulus during a fixation in reading. Perception & Psychophysics, 17, 578-586. McConkie, G.W., & Rayner, K. (1976). Identifying the span of the effective stimulus in reading: Literature review and theories of reading. In H. Singer & R.B. Ruddcll (Eds.), Theoretical models and processes in reading. Newark, DK: International Reading Association. McConkie, G. W., and Zola, D. E. (1979). Is visual information integrated across successive fixations in reading. Perception and Psychophysics, 25, 221-224. Morrison, R.E. (1984). Manipulation of stimulus onset delay in reading: Evidence for parallel programming of saccades. Journal of Experimental Psychology: Human Perception and Performance, 10, 667-682. Morrison, R.E. & Rayncr, K. (1981). Saccade size in reading depends upon character spaces and not visual angle. Perception

Eye movements and scene perception.

Research on eye movements and scene perception is reviewed. Following an initial discussion of some basic facts about eye movements and perception, th...
2MB Sizes 0 Downloads 0 Views