rspb.royalsocietypublishing.org

Rapid, generalized adaptation to asynchronous audiovisual speech Erik Van der Burg and Patrick T. Goodbourn School of Psychology, Brennan MacCallum Building (A18), University of Sydney, Sydney, NSW 2006, Australia

Research Cite this article: Van der Burg E, Goodbourn PT. 2015 Rapid, generalized adaptation to asynchronous audiovisual speech. Proc. R. Soc. B 282: 20143083. http://dx.doi.org/10.1098/rspb.2014.3083

Received: 19 December 2014 Accepted: 20 January 2015

Subject Areas: behaviour, neuroscience Keywords: temporal recalibration, audiovisual speech, synchrony judgement, cross-modal adaptation, multisensory integration

Author for correspondence: Erik Van der Burg e-mail: [email protected]

Electronic supplementary material is available at http://dx.doi.org/10.1098/rspb.2014.3083 or via http://rspb.royalsocietypublishing.org.

PTG, 0000-0001-7899-7355 The brain is adaptive. The speed of propagation through air, and of low-level sensory processing, differs markedly between auditory and visual stimuli; yet the brain can adapt to compensate for the resulting cross-modal delays. Studies investigating temporal recalibration to audiovisual speech have used prolonged adaptation procedures, suggesting that adaptation is sluggish. Here, we show that adaptation to asynchronous audiovisual speech occurs rapidly. Participants viewed a brief clip of an actor pronouncing a single syllable. The voice was either advanced or delayed relative to the corresponding lip movements, and participants were asked to make a synchrony judgement. Although we did not use an explicit adaptation procedure, we demonstrate rapid recalibration based on a single audiovisual event. We find that the point of subjective simultaneity on each trial is highly contingent upon the modality order of the preceding trial. We find compelling evidence that rapid recalibration generalizes across different stimuli, and different actors. Finally, we demonstrate that rapid recalibration occurs even when auditory and visual events clearly belong to different actors. These results suggest that rapid temporal recalibration to audiovisual speech is primarily mediated by basic temporal factors, rather than higher-order factors such as perceived simultaneity and source identity.

1. Introduction We combine information from different sensory modalities to create a rich and coherent representation of the external world [1– 5]. Multisensory integration conveys numerous benefits: for instance, in a noisy environment, we can better understand a speaker when we can also observe the speaker’s lip movements [6]. Typically, the perceptual benefits of audiovisual integration are optimal when information from the two different senses is presented in close temporal proximity, and declines with increasing temporal asynchrony [2,7,8]. In natural scenes, audiovisual events deriving from the same source are synchronized at their origin. However, owing to physical and neurological constraints, there are likely to be significant cross-modal differences in latency from the perspective of the observer. Sound travels more slowly than light; and conversely, neural transduction is typically faster in the auditory system than in the visual system [9]. Other factors, such as the allocation of spatial attention [10–13], and the intensity of stimuli [14–17], can also affect the speed of neural processing for auditory and visual information. Yet, our brain seems to compensate for the resulting latency differences by adapting to asynchronous multisensory information [18,19]. Vroomen et al. [19], for example, exposed participants to 240 audiovisual events—an auditory tone in combination with a visual flash—with a fixed temporal offset, for a total duration of 3 min. Following this adaptation procedure, participants saw the tone and flash presented across a range of stimulus onset asynchronies (SOAs), and were asked after each trial whether the audiovisual information was presented simultaneously or successively. They estimated the point of subjective simultaneity (PSS) by fitting a psychometric function and found strong evidence of temporal recalibration: the PSS was shifted towards the modality that was leading in the adaptation phase. Note that each experimental trial was preceded by a re-exposure phase (eight audiovisual events with the same temporal offset as in the adaptation phase) to top up adaptation before each trial. Such lengthy top-up procedures

& 2015 The Author(s) Published by the Royal Society. All rights reserved.

(a)

(b)

(c)

recalibrate to asynchronous audiovisual speech stimuli: the PSS on a given trial was highly contingent upon the modality order on the preceding trial. Moreover, we present compelling evidence that rapid recalibration generalizes across different audiovisual speech stimuli (Experiment 1), and across different actors (Experiments 2 and 3). Finally, we demonstrate that recalibration occurs following auditory and visual events that are not perceived to be simultaneous, and even when auditory and visual stimuli belong to different actors—such as when a male actor’s lip movements are paired with a female actor’s voice (Experiment 3). These results suggest that rapid temporal recalibration is primarily driven by temporal factors, with relatively little influence of higher-order factors such as perceived simultaneity and source identity.

2. Experiment 1 In Experiment 1, we examined whether participants rapidly adapt to asynchronous audiovisual speech stimuli. Participants were shown a brief clip of a male actor (figure 1a) saying a single syllable. The voice was temporally offset from the corresponding lip movement by a range of SOAs, and participants were asked to judge whether or not the lip movement was synchronized with the voice. If participants rapidly adapt to asynchronous audiovisual speech stimuli, then we expect the PSS to be contingent upon the modality order on the preceding trial. We also randomly interleaved two different syllables to examine whether rapid recalibration generalizes across different audiovisual speech stimuli.

(a) Material and methods (i) Participants Ten volunteers participated in the experiment (four females; mean age 27 years, ranging from 18 to 38 years). Eight participants were naive to the purpose of the experiment; the other two participants were the authors.

(ii) Stimuli and apparatus The experiment was programmed and run using E-prime software (Psychology Software Tools, Sharpsburg, PA, USA). Stimuli were presented on a 18-inch Trinitron E400 monitor (Sony Corporation, Tokyo, Japan) running at a spatial resolution of 1024  768 pixels and a refresh rate of 85 Hz. Participants were seated at a comfortable distance from the monitor (approximately 60 cm) and wore HD 380 Pro circumaural headphones (Sennheiser, Hanover, Germany), which delivered auditory stimuli with a peak intensity of approximately 70 dB. Movie clips were recorded of an Australian male actor (David;

Proc. R. Soc. B 282: 20143083

Figure 1. First frame of the movie clips used in the study. (a) David, actor in Experiments 1 and 2. (b) Sam, actor in Experiments 2 and 3. (c) Rosalynd, actor in Experiment 3. (Online version in colour.)

2

rspb.royalsocietypublishing.org

have since been shown to be critical to maintaining audiovisual temporal recalibration in this paradigm [20]. Typically, studies of temporal recalibration have used prolonged periods of asynchronous audiovisual adaptation— often lasting for several minutes—suggesting that adaptation may be rather sluggish [18,19,21–26] (but see [27]). Recently, however, Van der Burg et al. [28] have shown that adaptation to asynchronous audiovisual events occurs far more rapidly than was previously thought. Participants were asked to judge whether or not a flash was synchronized with a tone (that is, they performed a classical simultaneity judgement task) across a range of audiovisual SOAs. They also showed that the PSS on the current trial (t) was highly contingent upon the modality order of stimulus on the preceding trial (t21). More specifically, the PSS on trial t was shifted towards the leading modality on trial t21. These inter-trial effects suggest that adaptation to asynchronous multisensory information is fast, and can occur in the wake of a single audiovisual event, in the absence of an explicit adaptation procedure. Van der Burg et al. [28] proposed that a fast-acting recalibration mechanism serves to instantaneously realign signals at onset, in order to maximize the perceptual benefits of audiovisual integration. For instance, rapid adaptation to the first asynchronous audiovisual event in a stream of speech would optimize comprehension for the remainder of the stream. However, the perception of audiovisual speech is special: it exhibits enhanced multimodal integration relative to artificial stimuli [29], and relies on specialized neural structures [30]. It is thus important to determine whether rapid recalibration does in fact occur in the case of auditory speech in combination with the corresponding visual lip movements. Furthermore, the perceived timing of audiovisual events is known to depend critically on the nature of the stimuli. Dixon & Spitz [31], for example, measured the minimum temporal offset required to detect asynchrony between an auditory and a visual event, and found that participants needed a greater offset in order to detect asynchrony in a clip of audiovisual speech than in a clip of a hammer hitting a peg. They thus proposed that tolerance to audiovisual asynchrony is greater when stimuli are very familiar, as is the case for speech. It is feasible, then, that participants might rapidly adapt to audiovisual asynchrony for unfamiliar, inanimate stimuli (such as a visual flash and an auditory tone; see [28]), but not for familiar speech stimuli. Temporal recalibration has been found to generalize across different artificial stimuli, whether induced by prolonged [18,25] or brief [32] exposures. By contrast, Roseboom & Arnold [21] found that for audiovisual speech stimuli, participants concurrently maintained two different recalibrations—one with a positive shift in PSS, the other with a negative shift—when each was associated with a different actor. Their result cannot be attributed solely to location-specific recalibration [23,33], because it was still evident when actors switched locations after the adaptation phase. Recalibration for audiovisual speech may thus be uniquely stimulus-specific. However, the extent to which rapid recalibration generalizes across different audiovisual speech stimuli has not been explored previously. In this study, participants were shown a brief movie clip of an actor saying a single syllable. The voice either preceded or followed the corresponding lip movements across a range of SOAs (0, 50, 100, 200, 400 or 800 ms), and participants were asked to judge whether or not the lip movements were synchronized with the voice. We show that participants rapidly

0.6 0.4 0.2

90

mean PSS previous trial (t − 1) audition leads vision leads

0.8 0.6 0.4

60

30

0.2 0

−800 −400 0 400 800 audition leads vision leads SOA current trial (t) (ms)

(c)

0 −800 −400 0 400 800 audition leads vision leads SOA current trial (t) (ms)

different same syllable (t) versus syllable (t − 1)

Figure 2. Results of Experiment 1. (a) Proportion of ‘simultaneous’ responses as a function of SOA and modality order on the previous trial (t21), for trials in which the syllable was different from that on trial t21. Data are collapsed across participants. Curves show the best-fitting Gaussian in a least-squares sense. (b) Proportion of ‘simultaneous’ responses as a function of SOA and modality order on the previous trial (t21), for trials in which the syllable was the same as that on trial t21. (c) Mean PSS as a function of t versus t21 syllable congruency, and t21 modality order. In all panels, error bars represent +1 s.e.m. (Online version in colour.) figure 1a) saying one of two syllables: / / or / /. Clips were cropped to 2 s in duration, and QT Sync software (Michael Gerbes, Potsdam, Germany) was used to manipulate the SOA between the visual and auditory components. A noise-gating filter was applied to the audio such that the sound level ramped on and off over about 50 ms immediately before and after the syllable, while the remainder of the clip was silent. In total, 11 clips were created for each syllable, corresponding to the 11 SOAs used in the experiment.

(iii) Design and procedure Each trial started with the presentation of a white fixation dot (0.38 radius; 60 cd m22) on a black background (less than 0.5 cd m22) for a duration of 1 s. The fixation dot was replaced by a movie clip (16.8  12.08 wide) of 2 s duration, depicting the actor saying the syllable / / or / /. The auditory signal was the original voice, which was either advanced or delayed relative to the lip movement across a range of SOAs (0, 50, 100, 200, 400 or 800 ms). After presentation of the clip, a white question mark was displayed. Participants reported whether or not the lip movement and voice were synchronized by pressing the ‘1’ key or the ‘0’ key, respectively. The fixation dot for the next trial appeared after this unspeeded response. Prior to the experiment, participants performed practice trials to gain familiarity with the task. Each combination of modality order (audition leads or vision leads), SOA (0, 50, 100, 200, 400 or 800 ms) and syllable (/ / or / /) was presented four times in each block, in a random order. Participants performed two sessions, each containing four experimental blocks of 96 trials, leading to 768 trials in total. As our analyses are contingent on trial t21, the first trial of each block was excluded.

(b) Results and discussion We examined whether the probability of a ‘simultaneous’ response on a given trial (t) was reliably affected by the modality order of the previous trial (t21). We fit a Gaussian distribution to the proportion of ‘simultaneous’ responses as a function of SOA, separately for each participant, t21 modality order (audition leads or vision leads) and syllable

congruency (same or different syllable t versus t21). There were three free parameters to the fit: the mean, which we take to reflect the PSS; the standard deviation, which we take to reflect the bandwidth of the window of simultaneity; and the amplitude, which reflects the total proportion of ‘simultaneous’ responses. Critically, any effect of temporal recalibration should be evident in a comparison of the PSS parameter between experimental conditions. Accordingly, we focus here on analyses of PSS only. For completeness, we present analyses of the bandwidth and amplitude parameters in the electronic supplementary material; however, none of those analyses reveals effects that would change the interpretation of results presented here. Also in the electronic supplementary material, we report confirmatory non-parametric analyses for the PSS parameter. The results of Experiment 1 are depicted in figure 2. Figure 2a presents data, collapsed across participants, for trials on which the syllables differed between t and t21. The proportion of ‘simultaneous’ responses is shown as a function of SOA and t21 modality order, along with Gaussian fits to the data. Figure 2b presents the corresponding data for trials on which the syllables were the same for t and t21. The mean PSS, bandwidth and amplitude across all conditions were 41 ms, 208 ms and 0.95, respectively. PSS was entered as the dependent variable in an analysis of variance (ANOVA) with t21 modality order and t versus t21 syllable congruency as within-subject factors. The ANOVA yielded a reliable main effect of t21 modality order, F(1, 9) ¼ 9.25, p ¼ 0.014, h2p ¼ 0:51, reflecting the fact that the PSS was significantly shorter when audition led vision on the previous trial (31 ms) than when vision led audition on the previous trial (52 ms). The main effect of syllable congruency was not significant, F(1, 9) ¼ 0.17, p ¼ 0.691, h2p ¼ 0:02; nor was the two-way interaction, F(1, 9) ¼ 0.05, p ¼ 0.835, h2p ¼ 0:01. Although we did not use a prolonged adaptation procedure, we found compelling evidence that adaptation to asynchronous audiovisual speech stimuli occurs rapidly: PSS was highly contingent upon the modality order of the preceding trial. Furthermore, we found that rapid recalibration generalizes across different audiovisual speech stimuli: there was no main effect of syllable congruency nor was there an interaction between

3

Proc. R. Soc. B 282: 20143083

0

1.0

syllable (t) = syllable (t − 1) previous trial (t − 1) audition leads vision leads

PSS (ms)

proportion ‘simultaneous’

0.8

(b)

rspb.royalsocietypublishing.org

1.0

syllable (t) π syllable (t − 1) previous trial (t − 1) audition leads vision leads

proportion ‘simultaneous’

(a)

0.6 0.4 0.2

(c) 90

0.8 0.6 0.4

60

30

0.2 0

0 −800 −400 0 400 800 audition leads vision leads SOA current trial (t) (ms)

mean PSS previous trial (t − 1) audition leads vision leads

−800 −400 0 400 800 audition leads vision leads SOA current trial (t) (ms)

different same actor (t) versus actor (t − 1)

Figure 3. Results of Experiment 2. (a) Proportion of ‘simultaneous’ responses as a function of SOA and modality order on the previous trial (t21), for trials in which the actor was different from that on trial t21. Data are collapsed across participants. Curves show the best-fitting Gaussian in a least-squares sense. (b) Proportion of ‘simultaneous’ responses as a function of SOA and modality order on the previous trial (t21), for trials in which the actor was the same as that on trial t21. (c) Mean PSS as a function of t versus t21 actor congruency, and t21 modality order. In all panels, error bars represent +1 s.e.m. (Online version in colour.) syllable congruency and t21 modality order. These results indicate that a fast-acting recalibration mechanism serves to instantaneously realign signals at onset. This rapid realignment might maximize the perceptual benefits of audiovisual integration. For instance, generalization across different speech stimuli would be necessary in order for rapid adaptation to an initial asynchronous audiovisual event to produce optimized comprehension for the remainder of the stream.

3. Experiment 2 In Experiment 1, we showed for the first time, to our knowledge, that temporal recalibration generalizes across different speech stimuli. However, the actor was the same on all trials; thus not only was the higher-order factor of actor identity held constant, but lower-level properties of the visual and auditory stimuli also remained highly similar (i.e. lip movements on the same face were paired with syllables spoken in same voice). In Experiment 2, we examined whether rapid recalibration generalizes across different actors. Experiment 2 was identical in design to Experiment 1, with the exception that we used two different male actors (figure 1a,b) saying the same syllable. If rapid recalibration is not specific to a particular stimulus, then we expect transfer across actors: for any given trial, we expect the PSS to be contingent upon the modality order on the preceding trial, regardless of the actor.

(a) Material and methods (i) Participants Eight volunteers participated in the experiment (three females; mean age 26 years, ranging from 20 to 38 years). Six participants were naive as to the purpose of the experiment; the other two participants were the authors.

(ii) Stimuli and apparatus Materials were the same as in Experiment 1, except that we used two different Australian male actors (David, figure 1a; and Sam, figure 1b) saying the syllable / /.

(iii) Design and procedure The procedure was the same as for Experiment 1. Modality order (audition leads or vision leads), SOA (0, 50, 100, 200, 400 or 800 ms) and actor (David or Sam) were selected randomly on each trial. Participants performed two sessions, each containing four experimental blocks of 96 trials, leading to 768 trials in total. As for Experiment 1, the first trial of each block was excluded from analysis.

(b) Results and discussion We fit a Gaussian distribution to the proportion of ‘simultaneous’ responses as a function of SOA, separately for each participant, t21 modality order and actor congruency (same or different actor t versus t21). The results of Experiment 2 are depicted in figure 3. Figure 3a presents data, collapsed across participants, for trials on which the actors differed between t and t21. The proportion of ‘simultaneous’ responses is shown as a function of SOA and t21 modality order, along with Gaussian fits to the data. Figure 3b presents the corresponding data for trials on which the actor was the same for t and t21. The mean PSS, bandwidth and amplitude across all conditions were 37 ms, 205 ms and 0.97, respectively. PSS was entered as the dependent variable in an ANOVA with t21 modality order and t versus t21 actor congruency as withinsubject factors. The ANOVA yielded a reliable main effect of t21 modality order, F(1, 7) ¼ 5.93, p ¼ 0.045, h2p ¼ 0:50. This indicates that the PSS was shorter when audition led vision on the previous trial (24 ms) than when vision led audition on the previous trial (50 ms). The two-way interaction between t21 modality order and t versus t21 actor congruency was not significant, F(1, 7) ¼ 1.04, p ¼ 0.342, h2p ¼ 0:13; nor was the main effect of t versus t21 actor congruency, F(1, 7) ¼ 3.15, p ¼ 0.119, h2p ¼ 0:31. Experiment 2 thus replicated our earlier finding that rapid recalibration to asynchronous audiovisual speech stimuli occurs online, based on a single asynchronous audiovisual event. Furthermore, the results demonstrate that adaptation generalizes across different actors, suggesting that rapid recalibration is not specific to particular speech stimuli.

4

Proc. R. Soc. B 282: 20143083

0

1.0

actor (t) = actor (t − 1) previous trial (t − 1) audition leads vision leads

PSS (ms)

proportion ‘simultaneous’

0.8

(b)

rspb.royalsocietypublishing.org

1.0

actor (t) π actor (t − 1) previous trial (t − 1) audition leads vision leads

proportion ‘simultaneous’

(a)

previous trial (t − 1) audition leads vision leads

90

PSS (ms)

60

(b)

30

previous trial (t − 1) audition leads vision leads

60

(c)

30

5 *

90

*

*

60

30 *

0 congruent incongruent congruency (t − 1)

0 congruent incongruent congruency (t − 1)

−1000

−500 0 500 1000 audition leads vision leads SOA previous trial (t − 1) (ms)

Figure 4. Results of Experiment 3. (a) Mean PSS as a function of audiovisual congruency on trial t21, and t21 modality order, for ‘congruent’ trials in which audition and vision originated from the same actor. (b) Mean PSS as a function of audiovisual congruency on trial t21, and t21 modality order, for ‘incongruent’ trials in which audition and vision originated from different actors. (c) Mean PSS as a function of the SOA on trial t21. The dashed line shows the mean PSS under baseline conditions, when the SOA on trial t21 was zero. An asterisk indicates that the PSS was significantly ( p , 0.05; one-tailed) different from the PSS under baseline conditions. In all panels, error bars represent +1 s.e.m. (Online version in colour.)

4. Experiment 3 Multisensory interactions depend on both basic temporal factors, such as temporal proximity between the auditory and visual information [7,8,34,35], and higher-order factors, such as whether or not a participant believes that the auditory and visual stimuli belong together [36–38]. The latter factor—the participant’s belief that a single audiovisual event, rather than two separate unisensory events, has occurred—has been dubbed the unity assumption [39]. In Experiments 1 and 2, the unity assumption was never violated as the auditory information was always paired with the matching visual information. That is, a specific voice always coincided with the lip movements of the actor to whom the voice belonged. In Experiment 3, we manipulate the congruency between the auditory and visual events within single trials, in order to examine whether the unity assumption must be satisfied to produce rapid temporal recalibration (see also [36,40], for a similar methodology). Participants viewed either a male or female actor (figure 1b,c) in combination with either a male or female voice saying a single syllable. The actor producing the voice, and the actor producing the lip movements, were manipulated independently. Stimuli were presented such that half of the trials contained congruent audiovisual pairings (the male voice paired with the male actor’s lip movements, or the female voice paired with the female actor’s lip movements), and the remaining trials contained incongruent audiovisual pairings (the male voice paired with the female actor’s lip movements, or the female voice paired with the male actor’s lip movements). If rapid recalibration is largely due to temporal factors rather than higher-order factors—like the unity assumption—then we expect to observe rapid recalibration regardless of the congruency between auditory and visual stimuli.

(a) Material and methods (i) Participants Eight volunteers participated in the experiment (five females; mean age 29 years, ranging from 23 to 38 years). Six

participants were naive as to the purpose of the experiment; the other two participants were the authors.

(ii) Stimuli and apparatus Materials were the same as for Experiments 1 and 2, except that we used an Australian male actor (Sam; figure 1b) and female actor (Rosalind; figure 1c) saying the syllable / /. Furthermore, the voices and lip movements were independently manipulated such that on half of the trials, participants were presented with congruent pairings (the male voice paired with the male actor’s lip movements, or the female voice paired with the female actor’s lip movements), and on the remaining trials participants were presented with incongruent pairings (the male voice paired with the female actor’s lip movements, or the female voice paired with the male actor’s lip movements).

(iii) Design and procedure The procedure was the same as for Experiments 1 and 2. Modality order (audition leads or vision leads), SOA (0, 50, 100, 200, 400 or 800 ms), voice actor (Sam or Rosalind) and lip movement actor (Sam or Rosalind) were selected randomly on each trial. Participants performed four sessions, each containing four experimental blocks of 96 trials, leading to 1536 trials in total. As for Experiments 1 and 2, the first trial of each block was excluded from analysis.

(b) Results and discussion We fit a Gaussian distribution to the proportion of ‘simultaneous’ responses as a function of SOA, separately for each participant, t21 modality order, audiovisual congruency on trial t21 and audiovisual congruency on trial t. The results of Experiment 3 are shown in figure 4. Figure 4a shows PSS as a function of t21 audiovisual congruency and t21 modality order for congruent trials; figure 4b shows the same results for incongruent trials. The mean PSS, bandwidth, and amplitude were 49 ms, 205 ms and 0.99, respectively. PSS was entered as the dependent

Proc. R. Soc. B 282: 20143083

0

rspb.royalsocietypublishing.org

PSS (ms)

90

PSS (ms)

(a)

The present study provides the first demonstration, to our knowledge, that recalibration to asynchronous audiovisual speech stimuli occurs rapidly, on the basis of a single audiovisual event. Although we did not use an explicit adaptation

6

Proc. R. Soc. B 282: 20143083

5. General discussion

or top-up procedure, the effect sizes reported here are comparable to—or even greater than—the effect sizes reported in studies using prolonged adaptation procedures [21,41– 43]. Few previous studies have investigated whether temporal recalibration to asynchronous audiovisual events, either induced rapidly or using prolonged adaptation, generalizes across different stimuli. Here, we demonstrate for the first time, to our knowledge, that rapid recalibration generalizes across different speech sounds and actors: we found compelling evidence that the PSS on a given trial was contingent on the modality order on the preceding trial, even when the sounds or actors differed between the two successive trials. These findings are consistent with those of Navarra et al. [25], who showed that prolonged adaptation to asynchronous audiovisual stimuli generalized across different frequencies of an auditory tone. Likewise, Heron et al. [23] found that recalibration generalized across stimulus features (auditory and spatial frequencies), though it did not generalize across spatial locations. The results are also consistent with a recent study by Harvey et al. [32], who found that rapid recalibration generalizes across different colours and auditory frequencies. Is the inter-trial effect due to a purely auditory or visual recalibration, to a combination of independent auditory and visual recalibrations or to a unified audiovisual recalibration? Recently, Van der Burg et al. [44] investigated this by applying a simultaneity-judgement task across all combinations of visual, auditory and tactile modalities. They observed a strong inter-trial recalibration effect when participants performed an audiovisual task (a beep in combination with a flash), but not when the same visual or auditory event was presented in combination with a tactile pulse in a visuotactile or audiotactile task. The authors concluded that rapid recalibration is uniquely audiovisual: if recalibration was driven by separate, purely visual or auditory components, then it should also be observed in visuotactile and audiotactile tasks. In simultaneity-judgement tasks, it is difficult to determine whether a shift in the psychometric function reflects a genuine change in the experienced temporal relationship of the stimuli, or rather a change in the criteria used to judge whether or not a particular temporal relationship constitutes synchrony [45]. The absence of audiotactile and visuotactile inter-trial recalibration effects [44], however, supports the notion that rapid recalibration is not due to a shift in criterion, or response bias. That is, if response bias underpinned recalibration, one should expect the same effect when participants perform an audiotactile or visuotactile task. Furthermore, in the present study, recalibration was more consistent when vision led audition on the previous trial than when audition led vision (figure 4c; see also [28]). A response bias derived from the modality order of the previous trial should yield a symmetrical pattern of results. We thus contend that the recalibration effects reported here are unlikely due to response bias. Multisensory interactions depend on both lower-level factors, such as temporal proximity between events in different modalities [7,8,34,35], and higher-level factors such as the unity assumption [36 –38]. In this study, we find compelling evidence that rapid recalibration still occurs when the audiovisual speech stimuli obviously belong to different actors (i.e. the unity assumption is violated). Furthermore, we find that recalibration occurs even following crossmodal delays (800 ms) that almost never (1.9% of trials) produce an experience of simultaneity (see also [28]).

rspb.royalsocietypublishing.org

variable in an ANOVA with t21 modality order, t21 audiovisual congruency and t audiovisual congruency as withinsubject factors. The ANOVA yielded a significant effect of t21 modality order, F (1, 7) ¼ 18.80, p ¼ 0.003, h2p ¼ 0:89, indicating that the PSS was shorter when audition led vision on the preceding trial (41 ms) than when vision led audition (57 ms) on the preceding trial. The main effect of audiovisual congruency on trial t was not significant, F(1, 7) ¼ 2.23, p ¼ 0.179, h2p ¼ 24; nor was the main effect of audiovisual congruency on trial t21, F (1, 7) ¼ 0.02, p ¼ 0.888, h2p ¼ 0:00. None of the two-way interactions was significant, all F , 1.37, p . 0.280, h2p , 0:16, suggesting that temporal recalibration was independent of the congruency on the preceding (t21) and current (t) trial. Similarly, the three-way interaction was not significant, F(1, 7) ¼ 2.17, p ¼ 0.184, h2p ¼ 0:24. In follow-up analyses, we examined whether PSS varied as a function of the SOA on trial t21 (see also [28]). We expect the PSS on trial t to be smaller (i.e. vision must lead by a smaller amount to achieve perceived simultaneity) when audition is leading on trial t21, compared with the PSS on trial t when the SOA on trial t21 is zero (baseline conditions). By contrast, we expect the PSS on trial t to be larger (i.e. vision must lead by a greater amount to achieve perceived simultaneity) when vision is leading on trial t21, compared with the PSS under baseline conditions. Figure 4c illustrates how the mean PSS varies as a function of the SOA on trial t21. An ANOVA with PSS as the dependent variable, and SOA on trial t21 as the within-subject factor, yielded a reliable effect, F(10, 70) ¼ 4.8, p ¼ 0.001, confirming our expectation that PSS varies as a function of SOA on trial t21. In further exploratory analyses, we compared the PSS for each t21 SOA with the PSS under baseline conditions in a series of one-tailed t-tests. PSS was significantly shorter when audition led vision on trial t21 by 100 ms, t(7) ¼ 2.2, p ¼ 0.034, but not for shorter or longer SOAs (all p . 0.29). By contrast, PSS was significantly larger when vision led on trial t21 by 200 ms, t(7) ¼ 3.0, p ¼ 0.01; 400 ms, t(7) ¼ 2.1, p ¼ 0.036 and 800 ms, t(7) ¼ 4.1, p ¼ 0.002, but not for shorter SOAs (all p . 0.09). In Experiment 3, we again found evidence for rapid recalibration to asynchronous audiovisual speech. We found that recalibration occurred irrespective of audiovisual congruency on both trial t and trial t21. This suggests that the recalibration is largely due to temporal factors, and not susceptible to higher-order cognitive factors such as the unity assumption. These results are consistent with those of Green et al. [40], who found that the McGurk effect was equally strong for incongruent audiovisual pairings (e.g. when a male voice was paired with female lip movements) as for congruent pairings. Experiment 3 also indicates that perceived simultaneity on trial t21 is not necessary to produce recalibration: strong recalibration effects were evident even when vision led audition by 800 ms, under which conditions participants hardly ever perceived the auditory and visual events as simultaneous (1.9% of trials).

Data accessibility. The full dataset is available online from the Open Science Framework (https://osf.io/yxt4z/).

Acknowledgements. We thank the actors (David, Sam and Rosalynd) for their contributions.

Funding statement. This research was supported by a grant from the Australian Research Council to E.V.d.B. (Discovery Early Career Research Award, DE130101663). P.T.G. was supported by a grant from the John Templeton Foundation to H. Price, A. Holcombe, K. Miller and D. Rickles.

References 1.

2.

3.

Alais D, Burr D. 2004 The ventriloquism effect results from near-optimal bimodal integration. Curr. Biol. 14, 257– 262. (doi:10.1016/j.cub. 2004.01.029) Van der Burg E, Olivers CNL, Bronkhorst AW, Theeuwes J. 2008 Pip and pop: non-spatial auditory signals improve spatial visual search. J. Exp. Psychol. Hum. Percept. Perform. 34, 1053– 1065. (doi:10. 1037/0096-1523.34.5.1053) Meredith MA, Stein BE. 1983 Interactions among converging sensory inputs in the superior colliculus. Science 221, 389–391. (doi:10.1126/science. 6867718)

4.

5.

6.

7.

McGurk H, MacDonald J. 1976 Hearing lips and seeing voices. Nature 264, 746 –748. (doi:10.1038/ 264746a0) Morein-Zamir S, Soto-Faraco S, Kingstone A. 2003 Auditory capture of vision: examining temporal ventriloquism. Cogn. Brain Res. 17, 154–163. (doi:10.1016/S0926-6410(03)00089-2) Sumby WH, Pollack I. 1954 Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26, 212 –215. (doi:10.1121/1.1907309) Van der Burg E, Cass J, Olivers CNL, Theeuwes J, Alais D. 2010 Efficient visual search from synchronized auditory signals requires transient

audiovisual events. PLoS ONE 5, e10664. (doi:10. 1371/journal.pone.0010664) 8. Slutsky DA, Recanzone GH. 2001 Temporal and spatial dependency of the ventriloquism effect. NeuroReport 12, 7–10. (doi:10.1097/00001756200101220-00009) 9. Alais D, Carlile S. 2005 Synchronising to real events: subjective audiovisual alignment scales with perceived auditory depth and speed of sound. Proc. Natl Acad. Sci. USA 102, 2244–2247. (doi:10.1073/ pnas.0407034102) 10. Posner MI. 1980 Orienting of attention. Q. J. Exp. Psychol. 32, 3–25. (doi:10.1080/00335558008248231)

7

Proc. R. Soc. B 282: 20143083

stimulus usually decreases more rapidly than the intensity of a visual stimulus. For short distances, over which the difference in speed of propagation and stimulus intensity has little effect, audition is perceived before vision; however, the difference in perceptual latency is never greater than what is caused by neural transduction. Beyond about 10 m, at which point the difference in propagation speed balances the difference in neural transduction time, vision is perceived before audition; in this case, the difference in perceptual latency can continue to increase with distance indefinitely. There is thus a marked asymmetry in what we might term ecologically plausible differences in audiovisual timing: technological manipulations notwithstanding, audition can lead vision by only fractions of a second, yet considerably longer delays are possible when vision leads audition. Accordingly, we observed no effect on the PSS for trial t when audition led by more than 100 ms on trial t21; but a significant effect on PSS for trial t when vision led by up to 800 ms (the largest asynchrony tested). Curiously, past studies using prolonged adaptation procedures have found clear effects of adaptation to stimuli in which audition led by 200–300 ms, for simple tone –flash pairs [18,19] as well as for more complex speech stimuli [21]. Perhaps, a small rapid recalibration effect—too small to be detected in the present experiment—occurs at these SOAs, and prolonged procedures allow detection of the cumulative effect over many trials. We conclude that temporal recalibration to asynchronous audiovisual speech stimuli occurs rapidly. It appears to be induced only by stimuli with ecologically plausible differences in audiovisual timing, and may serve to rapidly realign auditory and visual signals to maximize speech comprehension. It generalizes across different stimuli and actors, and can be induced by auditory and visual events that are perceived to be asynchronous, suggesting that it is driven primarily by low-level temporal characteristics rather than higher-order cognitive factors.

rspb.royalsocietypublishing.org

We note that the current study cannot rule out the existence of subtle congruency effects that modulate the exact magnitude of recalibration. Rather, it serves as a basic demonstration that rapid recalibration to asynchronous audiovisual speech stimuli is driven primarily by temporal factors, and occurs even when higher-order principles, such as the unity assumption, are violated. Although our results are unambiguous, they stand in apparent opposition to results from at least one previous study of temporal recalibration. Roseboom & Arnold [21] found that observers could maintain concurrent recalibrations, which differed in the direction in which the PSS was shifted, for two different actors. Their study, however, used a conventional, prolonged adaptation procedure. This suggests that rapid and prolonged recalibration may be mediated by different mechanisms; or that rapid recalibration drives prolonged recalibration, but higher-order associations are possible only for the latter. In the above study, each actor was consistently paired with a spatial location during adaptation phases; but during isolated test trials, the PSS was shifted in a direction consistent with adaptation for the particular actor, regardless of presentation location. One possibility, then, is that the identity-specific recalibration they observed was developed over many instances of location-specific rapid recalibration, which transferred across spatial locations after it became associated with the higher-order factor of actor identity. While the transfer we observed across different actors may not be adaptive in some cases, transfer of adaptation across different consecutive speech stimuli is a useful device for improving comprehension in verbal communication. Rapidly adapting to a cross-modal delay that is detected during the first syllable of an utterance is likely to enhance the intelligibility of the remainder of the speech stream [6]. The asymmetrical adaptation we observed in Experiment 3 (see figure 4c) is also highly consistent with a mechanism whose purpose is to realign auditory and visual signals originating from the same source [28]. The difference in perceptual latency between synchronized (at origin) auditory and visual signals arises from two primary factors: the different air propagation speeds of sound (approx. 340 m s21) and light (approx. 3.0  108 m s21); and the slower (by approx. 30 ms) process of neural transduction in vision compared to audition [9]. In addition, with increasing distance from the source, the intensity of an auditory

24.

25.

26.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

humans: A behavioral and electrophysical study. J. Cogn. Neurosci. 11, 473–490. (doi:10.1162/ 089892999563544) Vatakis A, Spence C. 2007 Crossmodal binding: evaluating the ‘unity assumption’ using audiovisual speech stimuli. Percept. Psychophys. 69, 744–756. (doi:10.3758/BF03193776) Van der Burg E, Brederoo SG, Nieuwenstein MR, Theeuwes J, Olivers CNL. 2010 Audiovisual semantic interference and attention: Evidence from the attentional blink paradigm. Acta Psychol. 134, 198– 205. (doi:10.1016/j.actpsy.2010. 01.010) Van Atteveldt NM, Formisano E, Goebel R, Blomert L. 2004 Integration of letters and speech in the human brain. Neuron 43, 271–282. (doi:10.1016/j. neuron.2004.06.025) Welch RB, Warren DH. 1980 Immediate perceptual response to intersensory discrepancy. Psychol. Bull. 88, 638 –667. (doi:10.1037/0033-2909.88.3.638) Green KP, Kuhl PK, Meltzoff AN, Stevens EB. 1991 Integrating speech information across talkers, gender, and sensory modality: female faces and male voices in the McGurk effect. Percept. Psychophys. 50, 524–536. (doi:10.3758/ BF03207536) Navarra J, Vatakis A, Zampini M, Soto-Faraco S, Humphreys W, Spence C. 2005 Exposure to asynchronous audiovisual speech increases the temporal window for audiovisual integration of non-speech stimuli. Cogn. Brain Res. 25, 499 –507. (doi:10.1016/j.cogbrainres.2005.07.009) Vatakis A, Navarra J, Soto-Faraco S, Spence C. 2007 Temporal recalibration during asynchronous audiovisual speech perception. Exp. Brain Res. 181, 173–181. (doi:10.1007/s00221-007-0918-z) Vatakis A, Navarra J, Soto-Faraco S, Spence C. 2008 Audiovisual temporal adaptation of speech: temporal order versus simultaneity judgments. Exp. Brain Res. 185, 521–529. (doi:10.1007/s00221007-1168-9) Van der Burg E, Orchard-Mills E, Alais D. 2015 Rapid temporal recalibration is unique to audiovisual stimuli. Exp. Brain Res. 233, 53 –59. (doi:10.1007/ s00221-014-4085-8) Yarrow K, Jahn N, Durant S, Arnold DH. 2011 Shifts of criteria or neural timing? The assumptions underlying timing perception studies. Conscious. Cogn. 20, 1518–1531. (doi:10.1016/j.concog.2011. 07.003)

8

Proc. R. Soc. B 282: 20143083

27.

spatially specific. Exp. Brain Res. 218, 477– 485. (doi:10.1007/s00221-012-3038-3) Roseboom W, Kawabe T, Nishida S. 2013 Audiovisual temporal recalibration can be constrained by content cues regardless of spatial overlap. Front. Psychol. 4, 189. (doi:10.3389/fpsyg.2013.00189). Navarra J, Garcia-Morera J, Spence C. 2012 Temporal adaptation to audiovisual asynchrony generalizes across different sound frequencies. Front. Psychol. 3, 152. (doi:10.3389/fpsyg.2012.00152) Harrar V, Harris LR. 2008 The effect of exposure to asynchronous audio, visual, and tactile stimulus combinations on the perception of simultaneity. Exp. Brain Res. 186, 517 –524. (doi:10.1007/ s00221-007-1253-0) Vroomen J, Van Linden S, De Gelder B, Bertelson P. 2007 Visual recalibration and selective adaptation in auditory –visual speech perception: contrasting build-up courses. Neuropsychologia 45, 572–577. (doi:10.1016/j.neuropsychologia.2006.01.031) Van der Burg E, Alais D, Cass J. 2013 Rapid recalibration to asynchronous audiovisual stimuli. J. Neurosci. 33, 14 633 –14 637. (doi:10.1523/ JNEUROSCI.1182-13.2013) Tuomainen J, Andersen TS, Tiippana K, Sams M. 2005 Audio-visual speech perception is special. Cognition 96, B13–B22. (doi:10.1016/j.cognition. 2004.10.004) Campbell R. 2008 The processing of audio-visual speech: empirical and neural bases. Phil. Trans. R. Soc. B 363, 1001–1010. (doi:10.1098/rstb. 2007.2155) Dixon NF, Spitz L. 1980 The detection of auditory visual desynchrony. Perception 9, 719– 721. (doi:10. 1068/p090719) Harvey C, Van der Burg E, Alais D. 2014 Rapid temporal recalibration occurs crossmodally without stimulus specificity but is absent unimodally. Brain Res 1585, 120–130. (doi:10.1016/j.brainres.2014.08.028) Yarrow K, Roseboom W, Arnold D. 2011 Spatial grouping resolves ambiguity to drive temporal recalibration. J. Exp. Psychol Hum. Percept. Perform. 37, 1657–1661. (doi:10.1037/a0024235) Van der Burg E, Talsma D, Olivers CNL, Hickey C, Theeuwes J. 2011 Early multisensory interactions affect the competition among multiple visual objects. NeuroImage 55, 1208– 1218. (doi:10.1016/ j.neuroimage.2010.12.068) Giard MH, Peronne´t F. 1999 Auditory – visual integration during multimodal object recognition in

rspb.royalsocietypublishing.org

11. Van der Burg E, Olivers CNL, Bronkhorst AW, Theeuwes J. 2008 Audiovisual events capture attention: evidence from temporal order judgments. J. Vis. 8, 2. (doi:10.1167/8.5.2) 12. Spence C, Driver J. 1996 Audiovisual links in endogenous covert spatial attention. J. Exp. Pscyhol. Hum. Percept. Perform. 22, 1005– 1030. (doi:10. 1037/0096-1523.22.4.1005) 13. Spence C, Driver J. 1997 Audiovisual links in exogenous covert spatial orienting. Percept. Psychophys. 59, 1– 22. (doi:10.3758/BF03206843) 14. McGill WJ. 1963 Stochastic latency mechanisms. In Handbook of mathematical psychology, vol. 1 (eds RD Luce, RR Bush, E Galenter), pp. 309–360. New York, NY: Wiley. 15. Los SA, Hoorn JF, Grin M, Van der Burg E. 2013 The time course of temporal preparation in an applied setting: a study of gaming behavior. Acta Psychol. 144, 499–505. (doi:10.1016/j.actpsy.2013.09.003) 16. Los SA, Van der Burg E. 2013 Sound speeds vision through preparation, not integration. J. Exp. Psychol. Hum. Percept. Perform. 36, 1612– 1624. (doi:10. 1037/a0032183) 17. Nickerson RS. 1973 Intersensory facilitation of reaction time: energy summation or preparation enhancement? Psychol. Rev. 80, 489–509. (doi:10. 1037/h0035437) 18. Fujisaki W, Shimojo S, Kashino M, Nishida S. 2004 Recalibration of audiovisual simultaneity. Nat. Neurosci. 7, 773–778. (doi:10.1038/nn1268) 19. Vroomen J, Keetels M, De Gelder B, Bertelson P. 2004 Recalibration of temporal order perception by exposure to audio-visual asynchrony. Cogn. Brain Res. 22, 32–35. (doi:10.1016/j.cogbrainres.2004.07.003) 20. Machulla T, Di Luca M, Froehlich E, Ernst MO. 2012 Multisensory simultaneity recalibration: storage of the aftereffect in the absence of counterevidence. Exp. Brain Res. 217, 89 –97. (doi:10.1007/s00221011-2976-5) 21. Roseboom W, Arnold D. 2011 Twice upon a time: multiple concurrent temporal recalibrations of audiovisual speech. Psychol. Sci. 22, 872–877. (doi:10.1177/0956797611413293) 22. Navarra J, Hartcher-O’Brien J, Piazza E, Spence C. 2009 Adaptation to audiovisual asynchrony modulates the speeded detection of sound. Proc. Natl Acad. Sci. USA 106, 9169 –9173. (doi:10.1073/ pnas.0810486106) 23. Heron J, Roach NW, Hanson JVM, McGraw PV, Whitaker D. 2012 Audiovisual time perception is

Rapid, generalized adaptation to asynchronous audiovisual speech.

The brain is adaptive. The speed of propagation through air, and of low-level sensory processing, differs markedly between auditory and visual stimuli...
671KB Sizes 0 Downloads 7 Views