Consciousness and Cognition 30 (2014) 256–265

Contents lists available at ScienceDirect

Consciousness and Cognition journal homepage: www.elsevier.com/locate/concog

Iconic memory for the gist of natural scenes Jason Clarke ⇑, Arien Mack The New School for Social Research, Visual Perception Lab, 80 Fifth Avenue, 7th floor, New York City, NY 10011, United States

a r t i c l e

i n f o

Article history: Received 11 August 2014

Keywords: Iconic memory Gist perception Natural scene perception

a b s t r a c t Does iconic memory contain the gist of multiple scenes? Three experiments were conducted. In the first, four scenes from different basic-level categories were briefly presented in one of two conditions: a cue or a no-cue condition. The cue condition was designed to provide an index of the contents of iconic memory of the display. Subjects were more sensitive to scene gist in the cue condition than in the no-cue condition. In the second, the scenes came from the same basic-level category. We found no difference in sensitivity between the two conditions. In the third, six scenes from different basic level categories were presented in the visual periphery. Subjects were more sensitive to scene gist in the cue condition. These results suggest that scene gist is contained in iconic memory even in the visual periphery; however, iconic representations are not sufficiently detailed to distinguish between scenes coming from the same category. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction In a fleeting look at the world, we have the subjective impression of a richly detailed, panoramic scene. Glancing along Fifth Avenue, New York City, we feel that we momentarily register all of it: the busy sidewalk, the sky, vehicles, shapes, textures, colors, as well as an overall understanding of what we are looking at – the gist of the scene. While some scientists argue that this is due to a fleeting visual sensory memory of a rich phenomenal experience (Block, 2005, 2011; Koch & Tsuchiya, 2007; Lamme, 2010), others contend that we do not see as much of it as we think (De Gardelle, Sackur, & Kouider, 2009; Naccache & Dehaene, 2007; Noe, 2002; Rosenthal, 2007). How much information do we see in a glance? When presented with a display containing items, subjects can typically report around 4 of the items in the display (Erdmann & Dodge, 1898; Luck & Vogel, 1997; Sligte, Scholte, & Lamme, 2008; Sperling, 1960). This visual short-term memory (VSTM) lasts on the order of seconds (Luck, 2007), and its apparently limited capacity has been used to support the claim that we see far less than we think we do, with the contents of our conscious experience being the contents of VSTM (Block, 2011; Lamme, 2010). However, many psychophysical experiments have demonstrated that observers have more information available about the contents of a briefly presented display than they can typically report. Using an ingenious procedure, the partial-report procedure, in which a cue required subjects to report only a sample of a display containing letters (partial report) instead of all of the items (whole report), Sperling (1960) discovered that indeed subjects are able to report nearly all of the items for about half a second or so (depending on the conditions) after offset of the display. This brief memory was dubbed iconic memory (Neisser, 1967), and many experiments over the last fifty years or so have discovered more about the processes and representations underlying it. The basic findings are that it is high capacity, short duration, and precategorical in nature (Dick, 1974; Sperling, 1960), that is it does ⇑ Corresponding author. E-mail address: [email protected] (J. Clarke). http://dx.doi.org/10.1016/j.concog.2014.09.015 1053-8100/Ó 2014 Elsevier Inc. All rights reserved.

J. Clarke, A. Mack / Consciousness and Cognition 30 (2014) 256–265

257

not contain semantic or category information. However, there is conflicting evidence regarding the latter (e.g. Coltheart, 1980; Keysers, Xiao, Foldiak, & Perrett, 2005). Do human observers have an iconic memory for the contents and the gist of a natural or real-world scene? As active perceivers, we are faced with the task of processing and understanding the everyday complex, naturalistic visual scenes before us. Natural scenes contain a seemingly infinite number of objects that can be categorized. Human observers are able to effortlessly recognize and categorize these objects despite variations in lighting, occlusion by other objects, and unusual viewpoints (Logothetis & Sheinberg, 1996). Furthermore, objects rarely (if ever) appear in isolation, but are seen as part of a meaningful context (the so-called gist of the scene) and in typical situations their identities are indeed constrained by the context, semantically and physically speaking: a table lamp is more likely to appear in a living room setting than in a forest, for example, and a fire hydrant is more likely to appear on the sidewalk than in the sky. Such regularities, invariants, and knowledge gained through experience might lead to efficiency in processing by the visual system. Indeed, many studies show that the brain may be particularly efficient at the processing of such naturalistic stimuli (Biederman, 1972; Biederman et al., 1973, 1974; Boyce & Pollatsek, 1992; Braun, 2003; Fei-Fei, Van Rullen, & Koch, 2002). The purpose of this research is to begin to answer the questions: What information is contained in iconic memory from briefly presented natural or real-world scenes? Does iconic memory contain information about the gist of multiple scenes? Observers are remarkably efficient at extracting the gist or category from a glance at a natural scene when it is briefly presented (Potter, 1976; Fei-Fei, Iyer, Koch, & Perona, 2007; Fei-Fei et al., 2002; Schyns & Oliva, 1994), leading some researchers to claim that information about scene gist is preferentially processed in the brain (Fei-Fei et al., 2007). Indeed recent neurophysiological research provides evidence that a specific area of the brain, the parahippocampal place area (PPA), is involved in the processing of natural scenes. For example, using fMRI Epstein and Kanwisher (1998) found that the PPA responded selectively to passively viewed scenes but weakly to objects and not at all to faces. Moreover, they found that the response of the PPA was just as strong for empty rooms (and therefore just a spatial layout) as for rooms containing many objects, and the response disappeared when the spatial arrangement of the room was disrupted such that it no longer defined a coherent space (Epstein & Kanwisher, 1998). This suggests that natural scenes are a special and evolutionarily important kind of visual stimuli. Psychophysical and neuroimaging studies have found that humans are remarkably efficient at categorizing scenes with ‘‘minimal attention’’ (Fei-Fei et al., 2002). Indeed, neuroimaging studies have shown that neural activity in cortical areas known to be involved in natural scene perception is present even without selective attention (Peelen, Fei-Fei, & Kastner, 2009). However, other studies have demonstrated the need for attention in consciously perceiving natural scenes (Cohen, Alvarez, & Nakayama, 2011; Mack & Clarke, 2012) as well as priming from natural scenes (Clarke, Ro, & Mack, 2013 Abstract). These later studies show that while gist is picked up easily, it still requires attention for it to be conscious and thus reportable. Why should we be interested in whether the gist of natural scenes exists in iconic memory? For a few reasons at least: First, natural scenes are complex stimuli, and their presence in iconic memory supports a view of this memory as not only containing simple features but also high-level perceptual structure. Second, the gist of a scene (a beach, a bathroom etc) is information about the semantics or category of the display; therefore, its presence here would suggest that iconic memory is not precategorical but contains information about the meaning of the stimulus. Finally, demonstrating the presence of gist information in iconic memory suggests that neural areas, namely the PPA, that encode information about natural scenes, are active in iconic memory. As these areas are positioned higher up in the visual system, this would give further support for the view that iconic memory is a late process in the visual hierarchy (Keysers et al., 2005). The procedure we used to assess the contents of the iconic memory for scenes was modeled on Sperling’s partial versus whole report procedure which served him as the measure of the contents of iconic memory (e.g. Averbach & Coriell, 1961; Sperling, 1960). In all experiments, subjects were briefly presented with an array of scenes (for 250 ms in the first two experiments and 500 ms in the third experiment). In the no-cue condition, 200 ms after offset of the array, a one word gist descriptor e.g. ‘‘waterfall’’ appeared on the screen, and the subject’s task was to report whether a scene fitting that description had been present. In the cue condition, immediately after offset of the array, a cue directing attention to one of the no-longer present scenes appeared for 200 ms and was followed by a one word gist descriptor. The 200 ms ISI in the no-cue condition is, therefore, replaced by the 200 ms cue in the cue condition. The subjects’ task in the cue condition was to report whether the cued scene fit that description. While in many respects these measurements are similar to those used in earlier research, in the experiments reported here, subjects were not required to name the items present in the display (as they were in earlier studies), but instead to report whether a word following the display referred to any of the scenes in the no-cue condition or the cued scene in the cue condition. Following Sperling, our index of the contents of iconic memory was the difference between the number of items correctly reported in the cue and no-cue conditions (Sperling, 1960). We reasoned that while in the cue condition, the subject only has to inspect the cued scene (akin to the partial report condition for Sperling), in the no-cue condition the subject has to search more of the array: all of it when there is no matching scene and, on average, half of it when there is. While this is being done the memory is decaying. Thus, the no-cue condition may underestimate the information available in the iconic store. By the time the subject has searched through an internal representation of the array, it has disappeared. This search is not needed in the no-cue condition (akin to the partial report in Sperling’s experiments), which, therefore, provides a more accurate index of the contents of iconic memory as it allows us to sample the information that is available immediately at display offset without requiring a search of rapidly decaying representations.

258

J. Clarke, A. Mack / Consciousness and Cognition 30 (2014) 256–265

2. Experiment one In the first experiment, we asked whether an iconic memory exists for the gist of multiple (four) natural scenes taken from different basic-level categories (e.g. a beach, a temple, a mountain, a city street) presented simultaneously. 3. Methods and materials 3.1. Equipment The experiment was programmed and run on SuperLab 4.5. Stimuli were presented on a 1.83 GHz Intel Core Duo Mac Mini and a DELL M782 monitor set at 1152  864 resolution, with a refresh rate of 75 Hz. 3.2. Stimuli Stimuli were 160 photographs of natural scenes from diverse categories e.g. a bathroom, a garden, a castle, a cityscape, a forest, a beach. The photographs were found using Google Images. A group of 25 naïve subjects were shown each picture and asked to give one word to describe the gist of the picture in order to make sure the scenes had agreed upon gist. Scenes were chosen to be target scenes (40 photographs) if the same one-word gist descriptor was given by at least 70% of the naïve subjects. All scenes subtended 6 degrees horizontally and 5 degrees vertically at a viewing distance of 56 cm. A display consisted of 4 scenes centered around a fixation ‘+’ cross. The center of each scene was 4 degrees from fixation, while the nearest corner of the scene was 2 degrees from fixation (see Fig. 1). 3.3. General design and procedure Subjects were presented with 40 randomized trials. On each trial following the fixation cross (1500 ms), a display of 4 scenes was presented for 250 ms. In Experiment 1, each of the four scenes in any display was chosen with the criterion that it should come from a different basic-level category. In the cue condition, following the 4-scene display, a red line (the cue)

Fig. 1. Experimental set-up for four-scene display.

J. Clarke, A. Mack / Consciousness and Cognition 30 (2014) 256–265

259

appeared either above one of the two pictures in the upper half of the display or below one of the two pictures in the lower half of the display for 200 ms. In the no-cue condition, following the 4-scene display, a blank screen appeared for 200 ms. Subsequently in both conditions, a word appeared at fixation e.g. ‘‘forest’’, and the subjects’ task was to report whether the word described the gist of the cued scene (in the cued condition) or any of the scenes (in the uncued condition). A dialogue box was provided in which subjects had to type their response. Each subject saw half of the trials with a word that described the gist of either the cued scene (in the cue condition) or one of the scenes in the display (in the no-cue condition) and half of the trials with a word that did not describe the gist of the cued scene or any of the scenes in the display. There were ten practice trials containing scenes not used in theexperiment. Subjects were randomly assigned to one of two conditions. 3.4. Scoring and data analysis In a between-subjects design, performance with the cue was compared to performance without the cue. The index of iconic memory was the difference between the cue and no-cue conditions. We analyzed sensitivity using d0 . Subjects’ responses were scored as ‘hits’ if they successfully identified the word as describing the gist of the scene, and they were scored as ‘false alarms’ if they incorrectly identified the word as describing the gist of the scene. These hits and false alarms were then analyzed to ascertain a subject’s sensitivity. We predicted that if the cue is effective and subjects have an iconic memory for the gist of multiple scenes, they should show more sensitivity when the cue is present (the cue condition group), as evidenced by a higher d0 in the cue condition than in the no-cue condition (the no cue condition group). 3.5. Subjects Eighteen subjects from The New School University community were tested. All subjects reported normal or corrected-tonormal vision. Half of the subjects were assigned to each of the conditions, the cue or no-cue. 3.6. No-cue condition The subjects were instructed to focus on the fixation cross, which appeared for 1500 ms. Immediately after, a display consisting of 4 scenes equidistant from the center of the screen appeared for 250 ms. After a 200 ms ISI (chosen as 200 ms is within the duration of iconic memory), a word appeared at the center of the screen e.g. kitchen. The subjects’ task was to report whether the word referred to the gist of one of the scenes in the display, and they wrote their answer using a text box. Following this, they pressed a key to continue to the next trial. 3.7. Cue condition The procedure was the same as above, with one difference: following the 4 scene presentation, a cue (a red line) served to indicate which of the pictures to report on (thus replacing the ISI in the no cue condition). The cue was randomly presented and was equally likely to appear in any of the four locations: just above the location of the upper left or upper right picture or just below the lower left or lower right picture; every cue was equidistant from fixation and the center of the screen. 4. Results Adding a cue significantly improved gist identification as evidenced by higher d0 scores in the cue condition (mean d = 1.65; 0.44 SD) compared to the no cue condition (mean d0 = 1.1; 0.25 SD), unpaired t(16) = 3.07, p < .05 (see Fig. 2). Cohen’s effect size (d = .69) suggested a moderate to high significance.

d' scores

0

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

GIST AND ICONIC MEMORY

CUE

NO CUE

CUE CONDITION Fig. 2. Four scenes and iconic memory – results of mean d0 scores for the cue and no-cue conditions.

260

J. Clarke, A. Mack / Consciousness and Cognition 30 (2014) 256–265

Furthermore, Cowan’s K, a measure of memory capacity, which takes into account the false alarms, subtracts them from the hits, and multiplies the result by the number of items to be remembered (K = (Hits  FA) * N), showed that in the no-cue condition subjects have on average 1.5 of the 4 items in memory, while in the cue condition, they have on average 2.2 of the 4 items in memory. 5. Discussion The results show that without a cue subjects can typically report the gist of about 1.5 of the 4 scenes, while when a cue directs attention to a specific scene at offset of the four scene arrays, subjects can successfully report the gist of more of the scenes. This suggests that iconic memory contains gist information about more scenes than indicated by the no-cue condition. Moreover, as scene gist is a categorical form of information, this is support for iconic memory containing category information. This will be addressed in more detail in the General Discussion. While the cue condition reveals that the gist of more of the scenes is available for report than is revealed by the no-cue condition, it tells us little about the fidelity and detail contained in those representations. The scenes in this experiment were chosen so that there was minimal semantic overlap among scenes in any particular display e.g. a forest, a kitchen, a baseball game, and farm animals. Thus, the cue advantage seen here might be due to the fact that each scene comes from a different basic-level category, which maximizes the differences between them and makes them easier to distinguish from each other. The cue advantage seen here might come from rather sparse or low-level representations of scene gist. For example, a forest and a kitchen are quite dissimilar in the Spatial Envelope or spatial layout sense and subjects might be able to take advantage of this dissimilarity in judging whether a scene was present in the display or not. In other words, when presented with, say, the word ‘‘forest’’, subjects might have a representation containing a statistical summary of the spatial layout (e.g. vertical lines and low openness) in their memory, and this might lead them to correctly identify it as a forest, without requiring any detailed representation. Similar arguments were given against the claim that being able to detect the presence of animal in a peripherally presented scene was evidence for gist perception (Evans & Treisman, 2005). Perhaps subjects were only detecting a feature of an animal e.g. a feather, and this led to high levels of identification in these experiments. However, the question is would this ‘‘sketchy representation’’ be adequate to distinguish between exemplars from the same general class e.g. bodies of water (streams, rivers, ocean, lake, or pond) or different animals in scenes. The following experiments were designed to look at the nature and the fidelity of representations in iconic memory, that is, whether they contain enough detail to support scene gist identification when the scenes come from the same category. In order to examine more closely the detail of the representations of natural scenes in iconic memory, the next experiment used scenes in the display which came from the same basic-level category e.g. rooms. If subjects are using sparse representations to ascertain scene gist, they should not be able to do so in this new experiment as scenes were chosen to have a similar gist e.g. bedroom, bathroom, living room, and kitchen. If subjects are not relying on coarse representations but instead have more detailed representations in memory, then they should be able to distinguish between these different types of scenes. 6. Experiment 2 Experiment 2 sought to examine whether iconic memory representations are detailed enough (Sligte, Vandenbroucke, & Lamme, 2010) to support gist identification when each of the scenes in the display come from the same basic-level category. 7. Methods and materials 7.1. Stimuli Twenty basic-level categories were chosen for the experiment (e.g. rooms, bodies of water, thoroughfares, transport, wild animals, farm animals, celebrations). Four scenes were chosen for each category. For example, the scenes in the ‘‘wild animal’’ category were: bears playing in a wood, lions on a savannah, tigers running on a plain, and rhinos drinking at a riverbank. Again, each scene was shown to 25 naïve observers, who gave a one word gist descriptor of the scenes. Only the scenes that received the same gist descriptor by 70% of the subjects were used as target scenes. Furthermore, the 25 observers were asked to name another exemplar of the basic-level category that was not contained in the scenes. For example, in the case of ‘‘wild animals’’, giraffes would belong to the group and be another exemplar of this basic-level category. These words were then used in the trials where the gist word did not refer to any of the 4 scenes in the display (but belonged to the same category and, therefore, could have appeared in the display). 7.2. Subjects Eighteen new subjects were taken from The New School University community. All subjects reported normal or corrected-to-normal vision. Subjects were randomly assigned to either the cue condition or the no-cue condition.

J. Clarke, A. Mack / Consciousness and Cognition 30 (2014) 256–265

1

261

FOUR SCENES FROM SAME BASICLEVEL CATEGORY

d' scores

0.8 0.6 0.4 0.2 0

CUE

NO CUE

CUE CONDITION Fig. 3. Four subordinate scenes – mean d0 scores for the cue and no-cue conditions.

7.3. Methods and materials This experiment was very much like the previous experiment except for the stimuli contained in each display. In the previous experiment they were from different basic-level categories, while in this experiment they were from the same basiclevel category. However, importantly, the targets (the cued scenes) on each trial were the same as the targets in the previous experiment, only in the present experiment they were presented with other scenes from the same basic-level category e.g. rooms, bodies of water, wild animals, and public interiors. 8. Results The general finding here is that when the scenes come from the same basic-level category, the cue does not enable access to information from the iconic representation to identify the gist of the natural scenes. This is evidenced by non-significant differences in d0 scores in the cue condition (mean d0 = 0.73; .44SD) compared to the no cue condition (mean d0 = .92; 0.44 SD), unpaired t(16) = 0.9; .17 SD, p > .05 (see Fig. 3). Furthermore, the Cowan’s K measure of memory capacity showed that in the cue condition, subjects had the gist of a mean of 1.04 (.55 SD) scenes available in iconic memory and in the no-cue condition, subjects had a mean of 1.15 (.40 SD) scenes. Comparing this to the results of the previous experiments, in which in the cue condition, subjects had the gist of a mean of 2.2 (.44 SD) scenes in the cue condition and 1.12 (0.25 SD) in the no-cue condition, we see there is a difference in ability to report scene gist from iconic memory when the scenes come from different basic level categories compared to when they come from the same categories. When the scenes come from different categories, subjects can report the gist of a scene from an iconic representation. However, when the scenes come from the same basic-level category, subjects cannot. 9. Discussion This experiment looked for evidence of an iconic memory for the gist of scenes that come from the same basic-level category. This would give us some indication as to the amount of details stored in iconic memory. As items from the same basic level category have much in common (e.g. chairs can be armchairs, stools, office chairs, babies’ high chairs; outdoor urban scenes can be outdoor markets, bus stations, train stations, car parks etc.), the ability of a cue to enable access to an iconic representation with enough detail to discriminate it from a similar scene (and thus successfully report the correct answer in the task) would suggest that iconic memory contains even more structure or more detailed representations than had previously been demonstrated. However, the experiment indicates that when the scenes come from the same basic level category, the cue is not effective. This suggests that any iconic memory of the four scenes does not contain the resolution (or, in other words, the detail) to enable subjects to differentiate between scenes from the same basic-level category. This will be discussed more in the General Discussion section. 10. Experiment 3 As noted earlier, there is evidence that scene gist is picked up in the visual periphery with minimal attention (Fei-Fei et al., 2002) but not under conditions of inattention (Mack & Clarke, 2012). For example, in the Mack and Clarke study, while subjects’ attention was engaged in a task at fixation (reporting the longer arm of a briefly presented cross), a natural scene was presented in the visual periphery. Under these conditions, 85% of subjects failed to see the scene; that is, they were inattentionally blind to them. Even with divided attention (when subjects were performing the cross task but were expecting another stimulus), only 50% of subjects reported gist. This is in stark contrast to the Fei-Fei et al. study, in which subjects were able to report the presence or not of an animal in a briefly presented peripheral scene. The following experiment sought to examine whether subjects could pick up the gist of multiple scenes presented fully in the visual periphery from an iconic representation. Previous work has shown that perception of an object is best when

262

J. Clarke, A. Mack / Consciousness and Cognition 30 (2014) 256–265

viewers fixate within 1–2 degrees of the stimulus, with performance dropping off with increasing distance from fixation (Henderson & Hollingworth, 1999; Larson & Loschky, 2009). Therefore, even if an iconic representation of scenes presented peripherally exists, due to the lack of visual resolution in peripheral vision, subjects might not be able to pick it up efficiently. However, if subjects are able to report the gist of an iconic representation in a peripheral location, this suggests that iconic representations even in the periphery are structured and detailed enough for identification. It should be noted that in the previous two experiments, the nearer corner of each picture was located within central vision, while the center was 4 degrees from fixation. In this experiment, each picture was centered 7.5 degrees from fixation in the visual periphery. 11. Methods and materials 11.1. Equipment All experiments were programmed and run on SuperLab 4.5. Scenes were presented on a 1.83 GHz Intel Core Duo Mac Mini and a DELL M782 monitor set at 1152  864 resolution, with a refresh rate of 75 Hz. 11.2. Subjects Twelve subjects from The New School University participated in the experiments. Ages ranged from 21 to 23 years old. All participants reported normal or corrected-to-normal vision. 11.3. Stimuli Stimuli were 360 photographs of scenes taken from Google Image archives. Again, a group of 25 naïve subjects were shown each picture and asked to give one word to describe the gist of the picture in order to make sure the scenes had agreed upon gist. Scenes were chosen to be target scenes (60 photographs) if the same one-word gist descriptor was given by at least 70% of the naïve subjects. All scenes subtended 6 degrees horizontally and 5 degrees vertically at a viewing distance of 56 cm. A display consisted of 6 scenes arranged around a nominal clock-face, with the center of each scene being 7.5 degrees from fixation, thus placing it in the visual periphery. 11.4. General Design and Procedure Subjects were presented with 60 randomized trials. On each trial following the fixation cross 1500 ms, a display of 6 scenes was presented for 500 ms. The extra time allotted here (500 ms instead of 250 in the previous two experiments)

Fig. 4. Experimental set-up for six scene experiment.

J. Clarke, A. Mack / Consciousness and Cognition 30 (2014) 256–265

263

was given to allow more processing time for the display since Sperling had found that there was little difference in report for presentations of 15–500 ms (Sperling, 1960). In the cue condition, immediately following the offset of the display, a red line (the cue) extended from fixation to the edge of one of the scenes (6 degrees) for 200 ms. In the no-cue condition, no cue was presented (instead an ISI of 200 ms followed the offset of the six-scene display). Subsequently, in both conditions, a word appeared in the center of the screen (as in the previous experiments described above) and the subject’s task was to report whether the word described the gist of the cued scene in the cue condition or whether the word described any of the scenes in the no-cue condition. Importantly, as with the previous two experiments, no display contained more than one exemplar of the gist i.e. there were never two beaches or two gardens in the display. Each subject saw half of the trials with a word that described the gist of either the cued scene (in the cue condition) or one of the scenes in the display (in the no-cue condition) and half of the trials with a word that did not describe the gist of the cued scene or any of the scenes in the display. There were ten practice trials containing scenes not used in the main experiment. Subjects were randomly assigned to one of the two conditions: the cue condition or the no-cue condition (see Fig. 4). 12. Results Scoring and data analysis followed the same criteria as the previous two experiments. Again, adding a cue significantly enabled gist identification as evidenced by higher d0 scores in the cue condition (mean d0 = 1.7) compared to the no-cue condition (mean d0 = 0.69), unpaired t(10) = 2.39, p < .05 (see Fig. 5). A Cowan’s K analysis of the number of items contained in memory showed that in the no-cue condition subjects have on average 1.48 of the 6 items in memory, while in the cue condition, they have on average 2.81 of the 6 items in memory. 13. Discussion The results from this experiment demonstrate that when a cue draws attention to the location of a no longer visible scene immediately after offset of that scene, subjects have access to twice the number of scenes than in the no-cue condition. This is so even with peripherally presented scenes, which are known to have lower resolution due to their location (Larson & Loschky, 2009). 14. General discussion

d' scores

These three experiments sought to examine in more detail the nature of representations in iconic memory, looking at whether iconic memory can support multiple scene gist identification. In all three experiments, subjects were shown an array of scenes (4 in the first two experiments and 6 in the final experiment) under one of two conditions: (1) with a cue presented immediately at offset of the scenes followed by a word which did or did not refer to the gist of the cued scene, or, (2) the same conditions without the cue. Following Sperling’s rationale we reasoned that the number of scenes whose gist was reported in the cue condition (partial report) compared to the no-cue condition (whole report) provides an index of the amount of gist information available in iconic memory. Our results demonstrated that the gist of multiple scenes is available in iconic in iconic memory and that this information permits distinguishing the gist of scenes from each other when the scenes are from different basic-level scene categories (as evidenced by a higher sensitivity to the gist of the scene in the cue condition than in the no-cue condition) but is not adequate for distinguishing the gist of scenes from the same basiclevel category. Furthermore, subjects showed a significant cue advantage (versus the no-cue condition) for the identification of the gist of cued scenes in the 6-scene array when each scene was fully located in the periphery of the visual field when each scene came from a different category. This suggests that iconic representations even in the visual periphery contain enough detail and perceptual structure to support gist identification.

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

SIX SCENES AND ICONIC MEMORY

CUE

NO CUE

CUE CONDITION Fig. 5. Six scenes in periphery and iconic memory – results of mean d0 scores for the cue and no cue conditions.

264

J. Clarke, A. Mack / Consciousness and Cognition 30 (2014) 256–265

What information are subjects using to do this? Is gist picked up from identifying a single object? Does the ability to identify the gist of a cued iconic representation necessarily mean the whole scene was encoded semantically? Or could it be the case that subjects could artfully ‘‘pick up’’ the gist from strategizing based on coarse, low-level representations e.g. a representation of color of a scene, green, when faced with the word ‘‘forest’’. In this model, subjects do not have to know that the ‘‘green’’ was a forest, but can successfully guess using this low-level information. Similarly, if the cued scene was a supermarket interior and the word was ‘‘beach’’, subjects could use low-level information e.g. a representation of the supermarket that did not contain any more information than that it was an interior, to rule out this description as the gist of that scene (i.e. it was an interior ergo it cannot have been a beach.). Future experiments need to look more closely at the nature of these iconic representations and exactly what information subjects are using in the first experiment where the scenes come from different categories that is not available in the second experiment where the scenes come from the same categories. However, the results of the experiments presented here suggest that subjects are using low-spatial frequency and global information rather than specific object information to pick up scene gist from an iconic representation. This is consistent with the work of Oliva and colleagues, who have discovered that scene gist can be understood from global features that give a statistical summary of the spatial layout of the scene (e.g. Schyns & Oliva, 1994). Indeed, when the scenes are centered 7.5 degrees away from fixation, they must be using low-spatial frequency information due to the low resolution and poorer acuity in the visual periphery. If subjects are using global information, along with low-spatial frequency information, (which we can use successfully and indeed preferentially in hybrid stimuli consisting of superimposed low-spatial frequency and high-spatial frequency scenes to identify outdoor and indoor categories, mean depth and other Spatial Envelope categories when images are flashed very briefly (Oliva & Torallba, 2006; Schyns & Oliva, 1994)), then if we were to manipulate the spatial frequency in the scenes, we should continue to see the cue advantage even when scenes are only defined perceptually by low-spatial frequency information. However, if subjects are relying on the identification of an object in the scene for successfully detecting the gist, then this information should be removed when only low-spatial frequency information remains. Oliva (2005), describing the results of psychophysical experiments in her lab, writes: ‘‘These empirical results suggest that a reliable perceptual gist may be structured quickly based on coarse spatial scale information (from 4 to 8 cycles per image). At this resolution, enough structural cues are provided to allow the identification of the scene, although the local identity of objects may not be recovered.’’(Oliva, 2005; my italics). If the local identity of objects is not recoverable from spatial scales of 4–8 cycles per image, and a cue does allow subjects to access scene gist from an iconic representation shown at this spatial frequency, then this would suggest that whatever information the visual system is using to identify the gist does not come solely from identification of a local object. Is the scene gist information that is accessed by the cue in these experiments pre-categorical or post-categorical? In other words, does iconic memory for the scenes contain only perceptual information, which if attended, becomes available semantically? Or does iconic memory for the scenes also contain semantic or category information? Iconic memory has traditionally been supposed to be pre-categorical, due in part to semantic cues (e.g. a high tone requiring subjects to report the letters in a display and the low tone requiring subjects to report the numbers) being ineffective (e.g. Sperling, 1960). However, this view has been challenged based on experimental results. For example, when subjects are shown a briefly presented display and asked to report the letters (or the digits) they perform better than if asked to give a whole report (i.e. all letters and all digits), showing that they are able to use category information as a cue for retrieval from iconic memory (Duncan, 1983). In other experiments, Eriksen and Eriksen (1974) and Eriksen and Hoffman (1973) presented subjects with a row of three letters with the task of classifying the middle letter. They found that the latency of response was influenced by the flanking non-target letters. These results led Coltheart (1980) to argue that the information in iconic memory has been given full analysis and that ‘‘iconic memory is not an early sensory stage in the information processing sequence but a late stage that follows not precedes stimulus identification’’ (Coltheart, 1980). Furthermore and more recently, psychophysical and neuroimaging experiments show iconic memory for faces and objects to be located in the superior temporal sulcus (STS), a late area in the visual cortical ventral pathway, known to contain high-level information about the category of a stimulus (Keysers et al., 2005). This finding led Keysers and colleagues to argue that the distinction between precategorical and postcategorical should be replaced by a continuum of degrees of shape categorization with progressively more information being added as information is processed from early to high-level visual areas, leading to categorization in higher visual cortex (the STS). They see iconic memory as being ‘multi-layered’ and containing both kinds of information: pre-categorical, purely visual information, is replaced by category information as the iconic representation gets processed in successively higher layers. The information in lower levels (e.g. precise spatial information) disappears quickly leaving only category information, which also fades after several hundred milliseconds. According to Keysers et al., this model of iconic memory explains the mislocation errors that are typically made in partial report experiments (e.g. Sperling, 1960), where subjects mistakenly report letters from the Sperling matrix that were close to the location of the cued row as the onset of the cue increases. Spatial information and category information are both available in iconic memory; however, they both have different decay rates, with spatial information decaying more quickly than category information leading to mislocation errors. The experiments presented here support the multi-layer model of Keysers et al. (2005). Information about scenes is known to be preferentially processed in late visual areas (namely the parahippocampal place area, or PPA), a high-level area along the ventral pathway. With this in mind, an argument can be made that scene gist in iconic memory is processed at progressively higher levels of visual cortex and finally categorized in the PPA (much as Keysers et al. argue that the stimuli

J. Clarke, A. Mack / Consciousness and Cognition 30 (2014) 256–265

265

in their iconic memory experiments are categorized in the STS). At first, according to this model, perceptual gist would be created progressively from early visual areas onwards, leading to category information being embodied in higher visual areas, namely the PPA. Precise spatial information would decay rapidly leaving category information, which itself would decay within a few hundred milliseconds. However, both kinds of information would be available from iconic memory. In conclusion, the experiments reported here show that iconic memory does contain more high-level representations (i.e. scene gist) than has previously been shown to exist. It contains enough perceptual structure for subjects to be able to pick up the meaning or gist of the scene when those scenes come from different basic-level categories. However, when the scenes come from the same basic-level category, subjects are not able to identify the gist of the cued scene. This suggests that the iconic representation is a coarse representation based on information contained in low–spatial frequencies with enough detail to distinguish between dissimilar gist information (e.g. gardens versus rooms) but not to distinguish between similar gist information (e.g. bedrooms and bathrooms). Another experiment demonstrated that scene gist can be picked up from iconic memory when the scenes are presented fully in the periphery of the visual field and when each scene is from a different basic-level category. This work suggests that iconic memory does not contain only simple features (e.g. orientation, color, or motion information), but it also has information about the meaning of multiple scenes (i.e. their gist) gained from a more holistic or global perceptual structure in which these local features have been combined. These results indicate that more perceptual and semantic information is contained in this fleeting memory than has previously been demonstrated. References Averbach, E., & Coriell, A. S. (1961). Short-term memory in vision. Bell System Technical Journal, 40, 309–328. Biederman, I. (1972). Perceiving real-world scenes. Science, 177, 4043. Biederman, I., Glass, A. L., & Webb Stacy, E. (1973). Searching for objects in real world scenes. Journal of Experimental Psychology, 97(1), 22–27. Biederman, I., Rabinowitz, J. C., Glass, A. L., & Webb Stacy, E. (1974). On the information extracted from a glance at a scene. Journal of Experimental Psychology, 103(3), 597–600. Block, N. (2005). Two neural correlates of consciousness. Trends in Cognitive Science, 46–52. Block, N. (2011). Perceptual consciousness overflows cognitive access. Trends in Cognitive Science, 15, 12. Boyce, S. J., & Pollatsek, A. (1992). Identification of objects in scenes. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 541–543. Braun, J. (2003). Natural scenes upset the visual applecart. Trends in Cognitive Science, 7(1), 7–9. Clarke, J., Ro, T., & Mack, A. (2013). The persistence of inattentional blindness and the absence of priming by natural scenes. Journal of Vision, 13(9), 1136. Cohen, M. A., Alvarez, G., & Nakayama, K. (2011). Natural-scene perception requires attention. Psychological Science, 22, 9. Coltheart, M. (1980). The persistences of vision. Philosophical Transactions of the Royal Society of London: Biological Sciences, 302, 283–294. De Gardelle, V., Sackur, J., & Kouider, S. (2009). Perceptual illusions in brief visual presentations. Consciousness and Cognition, 18(3), 569–577. Dick, A. O. (1974). Iconic memory and its relation to perceptual processing and other memory mechanisms. Perception and Psychophysics, 16, 575–596. Duncan, J. (1983). Perceptual selection based on alphanumeric class: Evidence from partial reports. Perception and Psychophysics, 33, 533–547. Epstein, R., & Kanwisher, N. (1998). A cortical representation of the local visual environment. Nature, 392. Erdmann, B., & Dodge, R. (1898). Psychologische Untersuchungen über das Lesen auf experimenteller Grundlage. Halle: Niemeyer. In R. L. Green (Ed.). Human memory: Paradigms and paradoxes. Psychology Press (1992). Eriksen, C. W., & Hoffman, J. E. (1973). The extent of processing of noise elements during selective encoding from visual displays. Perception and Psychophysics, 25, 249–263. Eriksen, B. A., & Eriksen, C. W. (1974). Effects of noise letters upon the identification of a target letter in a nonsearch task. Perception and Psychophysics, 16, 143–149. Evans, K., & Treisman, A. (2005). Perception of objects in natural scenes: Is it really attention free? Journal of Experimental Psychology: Human Perception and Performance, 31(6), 1476–1492. Fei-Fei, L., Van Rullen, R., & Koch, C. (2002). Rapid natural scene categorization in the near absence of attention. Proceedings of the National Academy of Sciences of the USA, 99, 14. Fei-Fei, L., Iyer, A., Koch, C., & Perona, P. (2007). What do we perceive in a glance of a real-world scene? Journal of Vision, 7,1(10), 1–29. Henderson, J. M., & Hollingworth, A. (1999). High-level scene perception. Annual Review of Psychology, 50. Keysers, C., Xiao, D. K., Foldiak, P., & Perrett, D. I. (2005). Out of sight but not out of mind: the neurophysiology of iconic memory in the superior temporal sulcus. Cognitive Neuropsychology, 22, 316–332. Koch, C., & Tsuchiya, N. (2007). Attention and consciousness: Two distinct brain processes. Trends in Cognitive Science, 11, 16–22. Lamme, V. A. F. (2010). How neuroscience will change our view on consciousness. Psychology Press. Larson, A. M., & Loschky, L. C. (2009). The contributions of central versus peripheral vision to scene gist recognition. Journal of Vision, 9(10), 1–16. Logothetis, N. K., & Sheinberg, D. L. (1996). Visual object recognition. Annual Review of Neurosciences, 19, 577–621. Luck, S. J., & Vogel, E. K. (1997). The capacity of visual working memory for features and conjunctions. Nature, 390. Luck, S. J. (2007). Visual short-term memory. Scholarpedia, 2, 3328. . Mack, A., & Clarke, J. (2012). Gist perception requires attention. Visual Cognition, 20, 3. Naccache, L., & Dehaene, S. (2007). Reportability and illusions of phenomenality in the light of the global neuronal workspace model. Behavioral and Brain Sciences, 30, 481–548. Neisser, U. (1967). Cognitive psychology. New York: Appleton-Century-Crofts. Noe, A. (2002). Is the visual world a grand illusion? Journal of Consciousness Studies, 9, 5–6. Oliva, A. (2005). Gist of a scene. In L. Itti, G. Rees, & J. K. Tsotsos (Eds.), The encyclopedia of the neurobiology of attention. San Diego, CA: Elsevier. Oliva, A., & Torallba, A. (2006). Building the gist of a scene: The role of global image features in recognition. Progress in Brain Research, 155. Peelen, M. V., Fei-Fei, L., & Kastner, S. (2009). Neural mechanisms of rapid natural scene categorization in human visual cortex. Nature, 2. Potter, M. C. (1976). Short-term conceptual memory for pictures. Journal of Experimental Psychology: Human Learning and Memory, 2, 509–522. Rosenthal, D. (2007). Phenomenological overflow and cognitive access. Behavioral and Brain Sciences, 30, 481–548. Schyns, P., & Oliva, A. (1994). From blobs to boundary edges: Evidence for time- and spatial-scale-dependent scene recognition. Psychological Science, 5, 195–200. Sligte, I. G., Scholte, H. S., & Lamme, V. A. F. (2008). Are there multiple visual short-term memory stores? PLoS, 3, 2. Sligte, I. G., Vandenbroucke, A., & Lamme, V. A. F. (2010). Detailed sensory memory, sloppy working memory. Frontiers in Psychology, 1, 175. Sperling, G. (1960). The information available in brief visual presentations. Psychological Monographs, 74.

Iconic memory for the gist of natural scenes.

Does iconic memory contain the gist of multiple scenes? Three experiments were conducted. In the first, four scenes from different basic-level categor...
661KB Sizes 2 Downloads 8 Views