Fusion of spatially separated vowel formant cues Marko Takanen,a) Tuomo Raitio, Olli Santala, Paavo Alku, and Ville Pulkki Department of Signal Processing and Acoustics, Aalto University School of Electrical Engineering, P.O. Box 13000, FI-00076 Aalto, Finland

(Received 10 April 2013; revised 25 September 2013; accepted 7 October 2013) Previous studies on fusion in speech perception have demonstrated the ability of the human auditory system to group separate components of speech-like sounds together and consequently to enable the identification of speech despite the spatial separation between the components. Typically, the spatial separation has been implemented using headphone reproduction where the different components evoke auditory images at different lateral positions. In the present study, a multichannel loudspeaker system was used to investigate whether the correct vowel is identified and whether two auditory events are perceived when a noise-excited vowel is divided into two components that are spatially separated. The two components consisted of the even and odd formants. Both the amount of spatial separation between the components and the directions of the components were varied. Neither the spatial separation nor the directions of the components affected the vowel identification. Interestingly, an additional auditory event not associated with any vowel was perceived at the same time when the components were presented symmetrically in front of the listener. In such scenarios, the vowel was perceived from the direction of the odd formant components. C 2013 Acoustical Society of America. [http://dx.doi.org/10.1121/1.4826181] V PACS number(s): 43.71.Es, 43.66.Pn, 43.71.An [JMH]

I. INTRODUCTION

The human auditory system forms a separate auditory stream for each sound source while analyzing the auditory scene of the surrounding space (Bregman, 1994; Griffins and Warren, 2004; Shinn-Cunningham, 2008). Due to the great importance of speech in human communication, several studies have addressed auditory scene analysis by studying the perception of speech as a fusion of separate components. It has been found that a three-tone sinusoidal replica of phonated speech following the center frequencies of the three lowest formants can be interpreted as speech (Remez et al., 1981) and that three frequency-modulated sinusoidal tones of equal amplitude positioned at the three first formants of a vowel are sufficient to identify the vowel (Lyzenga and Moore, 2005). Such results have led to the idea that the auditory system presumes a signal with any speech-like character to be speech (Cooke and Ellis, 2001). Speech components are fused at a perceptual level when they are spatially separated (Cutting, 1976). In this article, the term spatially separated refers to scenarios where the two components are presented from different positions around the listener. The fusion of spatially separated speech components was studied first by Broadbent (1955) as well as by Broadbent and Ladefoged (1957). In their experiments, a synthetic speech stimulus was filtered into two components, one containing low frequencies and the other high frequencies. When the resulting components were simultaneously presented to different ears of the listener, the so-called spectral fusion (Cutting, 1976) occurred, and the subjects reported hearing only the original stimulus. However, the occurrence of spectral fusion requires the fundamental a)

Author to whom correspondence should be addressed. Electronic mail: [email protected]

4508

J. Acoust. Soc. Am. 134 (6), December 2013

Pages: 4508–4517

frequencies, the F0s, of the two components to be identical. Broadbent and Ladefoged (1957) demonstrated that a difference of 25 Hz in the F0s inhibits spectral fusion even if the two components are simultaneously presented to the same ear (Broadbent and Ladefoged, 1957). Later, Darwin and Hulkin (2004) demonstrated that the occurrence of spectral fusion depends also on the choice of the cutoff frequency of the filter used to separate the stimulus into two components. The fusion of spatially separated components of speech can lead to the perception of two auditory events. This phenomenon, known as duplex perception (Whalen and Liberman, 1987), was first discovered by Rand (1974). He divided the speech stimulus, the syllable /da/, into two components, one containing only the formant transition of the second or the third formant in the beginning of the utterance and the other containing the remaining signal, denoted as the base of the utterance. When such components were dichotically presented to the ears of the listener, the subjects reported hearing the correct utterance /da/ in one ear and a secondary non-speech sound, called a chirp, in the other. This happened despite the fact that the base of the utterance was by itself not sufficient for the identification of the utterance (Liberman and Mattingly, 1989). Whether the utterance is correctly perceived and whether the chirp is heard depends on the stimulus onset asynchrony between the transition and the base of the utterance (Bentin and Mann, 1990), the amount of masking noise in the stimulus (Bentin and Mann, 1990), and the level difference between the transition and the base of the utterance (Rand, 1974; Liberman and Mattingly, 1989). Furthermore, Whalen and Liberman (1987) demonstrated that duplex perception can also occur when the two components are presented to the same ear of the listener, if the formant transition is replaced with a frequency-modulated sinusoidal signal following the change in the formant frequency.

0001-4966/2013/134(6)/4508/10/$30.00

C 2013 Acoustical Society of America V

Related to the detection of the chirp in the aforementioned experiments demonstrating duplex perception, some portion of the stimulus can be interpreted both as part of speech and as an additional sound event (Cooke and Ellis, 2001). For instance, when frequency-modulation was introduced to a harmonic close to a formant in a synthetic vowel, the harmonic was perceived also as a separate auditory event while still contributing to the vowel perception (Gardner and Darwin, 1986). The use of whispered speech may provide more knowledge about the perception of speech as a fusion of separate components. The production of whispered speech lacks the periodic glottal excitation, and therefore whispered sounds do not have a harmonic structure that is present, for example, in phonated vowels (Ito et al., 2005). In addition, whispered speech typically has more high-frequency energy than voiced utterances of phonated speech. Consequently, the energy of whispered speech is distributed more densely in frequency compared to phonated vowels in which the main spectral content is located at the lowest harmonics. Despite the absence of cues based on fundamental frequency, listeners have been able to identify two simultaneously presented whispered vowels with about the same accuracy as simultaneously presented vowels sharing a common F0 (Scheffers, 1983). Culling and Summerfield (1995) generated synthetic vowels from bandpass-filtered noise samples, two for each vowel, and they studied the identification of a target vowel in the presence of one, two, or three masker vowels using headphone reproduction. They found that the target vowel is correctly identified when presented to the ear that is opposite to the one into which the masker is simultaneously presented. They also found that the target vowel is not correctly identified when interaural time difference is used to separate the target and the masker. Additionally, they found that the target vowel is again correctly identified when it is presented incoherently to the two ears of the listener even in the presence of three diotically presented masker vowels. The effects of spatially separated formant cues on the perception of speech were investigated in the present study. Unlike the previous investigations, a phonated vowel utterance was first divided into a glottal excitation and vocal tract with a glottal inverse-filtering algorithm, and noise-excited counterparts for each vowel were generated. This procedure ensured that the stimuli preserved the natural characteristics of vowels but exhibited no periodic fundamental frequency cues. Even and odd resonances of the vocal tract were then separated and presented from different directions around the listener using a multichannel loudspeaker reproduction system in anechoic conditions. The selected reproduction method enabled accurate positioning of the separated formant cues both in front and at the side of the listener as well as ensured externalization of the perceived auditory events. Loudspeaker reproduction has also one additional benefit in comparison to headphone reproduction. Namely, the loudspeaker response at the eardrum is natural and free from any coloration artifacts that could be introduced in headphone reproduction. This unique procedure was employed to investigate the relative impacts of the monaural and binaural grouping cues J. Acoust. Soc. Am., Vol. 134, No. 6, December 2013

on perception of speech as a fusion of separate components. Here, the monaural grouping cues refer to the spectral contents of the spatially separated components, and the binaural ones to the directional cues evoked by the components. Based on the findings mentioned above, it is presumed that only one sound event is perceived and that the vowel is correctly identified when the two components are presented from the same direction. However, when the spatial separation between the components is increased it may be that the fusion of the components no longer occurs, and consequently, the vowel is not correctly identified. Furthermore, it may happen that only one auditory event is perceived when the vowel is not correctly identified. Such a result would indicate that there are otherwise audible sounds in the scene that do not reach conscious perception (Shinn-Cunningham et al., 2007; Snyder et al., 2012). On the other hand, if the two components are both perceived as separate auditory events, the occurrence of spectral fusion does indeed require that the to-be-fused components either share the common F0 (Broadbent and Ladefoged, 1957) or that they are colocated (Culling and Summerfield, 1995). Alternatively, studying the fusion of format cues in the proposed scenario might show that the vowel identification is not at all affected by the increased spatial separation between the components, and the only issue affected is the number of perceived auditory events. If the components are fully fused, only one auditory event is perceived. According to the “consistent-object” hypothesis (Schwartz and Shinn-Cunningham, 2010), the perceived direction of the vowel should then fall between the directions from which the two components are presented (Best et al., 2007). Then again, the study may indicate that duplex perception occurs as one of the components contributes to the identification of the vowel and is at the same perceived as an additional auditory event. Such a result would indicate that the directional cues of the two components have enough weight in auditory scene analysis to evoke the perception of the additional auditory event, but not to break the perception of the vowelidentity. It is also possible that the weight of the directional cues depends also on whether the cues point to opposing hemispheres or not. II. EXPERIMENTS

In order to achieve the goals of the study, a stimulus generation method capable of separating even and odd formants was first adopted. The stimuli generated with this method were then used in a spatial sound reproduction system to investigate speech perception of test subjects. A. Stimuli

The stimuli employed in the experiment consisted of incoherent samples of masking noise and sounds that were obtained by separating spoken Finnish vowels into two components, one including only the even formants and the other involving only the odd formants. The separation of the voiced utterances into the two components of all eight Finnish vowels (/a/, /e/, /i/, /o/, /u/, /y/, /æ/, and /œ/) was performed on the recordings of a male speaker, which were Takanen et al.: Fusion of noise-excited vowels

4509

conducted in an anechoic chamber using a condenser microphone (B&K 4188). The speech signals were sampled with a sampling frequency of 22.05 kHz, using a resolution of 16 bits. The sampling rate of the data was thereafter reduced to 16 kHz with an anti-aliasing lowpass filter in order to efficiently apply linear prediction (LP) based algorithms to create the stimuli. Each of the recorded vowels were processed with a glottal inverse-filtering algorithm, iterative adaptive inverse filtering (Alku, 1992; Alku et al., 1999), in order to separate the signals into the time-domain glottal source signal g(t) and the vocal tract transfer function Hv(z). The obtained transfer functions were later used to generate noise-excited counterparts for each vowel, but the resonances of the vocal tract transfer functions needed to be enhanced first. This was deemed necessary in order to improve vowel identification, since noise-excited vowels are known to represent weaker formant cues than phonated vowels (Katz and Assmann, 2001; Alku et al., 2001). Formant enhancement was conducted by expressing Hv(z) of each vowel with the line spectral pair decomposition (Soong and Juang, 1984) and using a technique proposed by Ling et al. (2006) with the value 0.4 for the enhancement parameter. The vocal tract filter following the formant enhancement is denoted below as Hsv(z). The magnitude responses shown in Fig. 1(a) illustrate that the formant enhancement effectively raised the formant amplitudes and attenuated the spectral valleys between the formants by keeping the formant center frequencies (i.e., vowel identity) unchanged. Hence, the processing amplified the formant cues making the noise-excited vowels sound clearer without affecting the perception of the vowel identity.

Furthermore, the obtained glottal source signal g(t) was not used as an excitation signal to the corresponding vocal tract transfer function. In contrast, aperiodic noise sequences were used in order to obtain stimuli with a denser spectral distribution (see Sec. I). However, the spectral envelope of the noise sequence was matched to that of the glottal source signal. More precisely, a second-order LP analysis was performed on the original glottal source signal g(t) estimated from the vowel /a/, and a set of incoherent white-noise signals were filtered with the same LP filter to obtain aperiodic noise sequences with the desired spectral envelope. A total of 38 incoherent white-noise signals were filtered with the LP filter. Sixteen of the resulting noise sequences were used to generate the even and odd formant components of the eight different vowels, and 22 were used as the masking noise in the experiment. An example of the magnitude response of the obtained noise sequence is presented in Fig. 1(b). On completion of these preliminary processes, the different noise sequences were filtered with the Hsv(z) of the corresponding vowels to generate noise-excited counterparts of the original vowels. The noise-excited vowel signals lacked the sparse harmonics of the phonated vowels but shared the same overall spectral tilt modeled from the vowel /a/. In order to obtain the even and odd formant components, the noise-excited vowel signals were filtered with a set of bandpass filters. Each of the filters was a 180th-order window-based finite impulse response filter centered around a given format of the noise-excited vowel. The formant frequencies were estimated by inspecting the magnitude response of Hsv(z) for each vowel. The obtained even and odd formant components of the noise excited vowels are

FIG. 1. Magnitude response of (a) the estimated vocal tract filter Hv of the vowel /æ/ and the same filter following the formant enhancement, (b) a noise excitation to the vocal tract filter, (c) the even and odd formant components of the vowel /æ/, and (d) the noise-excited vowel /æ/ and the signal obtained by summing the even and odd formant components of the noise-excited vowel. Similar samples of the noise excitation were used to generate the noise-excited vowels and used as the masking noise in the experiment. The plotted responses in (b), (c), and (d) are smoothed using a 1/16 octave bandwidth for visualization purposes.

4510

J. Acoust. Soc. Am., Vol. 134, No. 6, December 2013

Takanen et al.: Fusion of noise-excited vowels

^ even and w ^ odd , respectively. The responses denoted as w shown in Fig. 1(d) illustrate that the magnitude spectrum of the noise-excited vowel /æ/ is accurately preserved in the magnitude response of the signal that was obtained by summing the signals of the even and odd formant components. It should be noted that two incoherent noise-excited vowel sig^ odd , separately for ^ even and w nals were used to generate w each vowel, in order to ensure that the coherence between the two components cannot be used as a cue to group them together in the listening test. For each vowel, the odd formant components consisted of the first, third, fifth, and seventh formants (denoted respectively by F1, F3, F5, and F7), whereas the number of formants in the even formant components varied between the vowels. For vowels /a/, /i/, /o/, /æ/, and /œ/, the even formant components contained the second, fourth, and sixth formants (denoted by F2, F4, and F6, respectively). For vowels /e/, /u/, and /y/, the eight formant (F8) was also included in the even formant components. The magnitude responses of the even and odd formant components of the vowel /ae/ are shown in Fig. 1(c). All generated masking noise signals as well as the even and odd formant components of the different vowels were resampled at 48 kHz, which was used as the sampling frequency in the listening test. The generated signals had the length of 5 s, and randomly selected samples of those signals were employed in the different measurements of the experiment. Each sample was designed to have a 25-ms-long linear rise and decay. B. Experimental setup

The listening test was conducted in a medium-sized anechoic room equipped with a multichannel sound reproduction system. The setup employed in the listening test consisted of 21 equidistant loudspeakers distributed on an arc in the horizontal directions ranging from 150 to 150 with a separation of 15 , and one in the direction of 180 , as depicted in Fig. 2. All of the loudspeakers were Genelec 8030A active monitors having a 62 dB flat frequency response between 58 Hz and 20 kHz. Each loudspeaker was calibrated to reproduce an equal A-weighted sound pressure level (SPL) measured at the listening position with a tolerance of 60.5 dB. C. Test subjects

Eighteen voluntary native speakers of Finnish all staff members of Aalto University, between the age from 23 to 39 yr, participated in the test. All participants had hearing thresholds less than 15 dB in the frequency range from 250 Hz to 8 kHz. Seventeen of the participants had previous experience of listening tests in general. The authors did not participate in the test.

FIG. 2. Loudspeaker setup employed in the listening test consisting of 22 equidistant loudspeakers distributed in the horizontal plane. All loudspeakers were used to reproduce the masking sound in the three parts of the primary test. The loudspeakers used to reproduce the two components of the target sound in two different test scenarios sharing the same value of 60 for the parameter separation are marked with different shades. The gray and black indicate the loudspeakers used when the value of the parameter location is either 0 or 90 , respectively.

spatially separated formant cues due to its suitable formant structure: The formants are evenly spaced, and the amplitudes of the even and odd formants are similar. Such a formant structure enables dividing the vowel into even and odd formant components such that neither of the two components is by itself sufficient for accurate identification of the vowel. A pilot test conducted prior to the experiment confirmed that only three of the noise-excited vowels (/e/, /æ/ and /œ/) could not be reliably identified when only one of the two components was involved. In addition, /æ/ was the only one of these three vowels that was accurately identified when both components were simultaneously presented. Hence, the rest of the vowel stimuli were not used in studying the fusion of the separated formant components per se, but they were used as additional stimuli in order to present listeners with all the Finnish vowels. None of participants of the pilot test took part in the actual experiment. Counterbalancing was used to control for sequential effects on the results. In other words, the 18 participants were randomly assigned to six groups, each of which did the three parts of the primary test in an order specified in Table I. The participants were allowed to take short breaks TABLE I. Order in which six different participant groups did the three parts of the primary test. The primary test was always preceded by the pretest, which was also considered as a training session for the primary test. Group

D. Test procedure

The experiment consisted of a pretest and a primary test that was further divided into three separate parts, each addressing a different question. From the stimuli generated, the vowel /æ/ was selected to be used to study the fusion of J. Acoust. Soc. Am., Vol. 134, No. 6, December 2013

Test part 1 2 3

1

2

3

4

5

6

first second third

second third first

third first second

first third second

second first third

third second first

Takanen et al.: Fusion of noise-excited vowels

4511

between the different parts of the experiment and the participants completed the experiment in 39 min, on average. 1. Pretest

The purpose of the pretest was threefold-to familiarize the participant with the stimuli used in the experiment, to prescreen such participants who were unable to detect the vowels when both the even and odd formant components were presented from the same direction at the same time, and to study whether the participants are able to identify the vowel /æ/ based on only the even or odd formant components. Only the loudspeaker directly in front of the participant, i.e., at 0 , was used to emit sounds in the pretest, and all the stimuli used in the pretest were scaled to share the same root-mean-square value to minimize differences between the samples in the perceived loudness. The A-weighted SPL of the sound reproduction in the pretest was 58 dB. The pretest began with a short training session, during which the participant was able to listen to 1-s-long monophonic samples of the eight different noise-excited vowels and the masking noise signal one at a time by selecting the corresponding sample using the graphical user interface (GUI). After listening to each sample at least once and deciding that he or she was familiar with the samples, the participant was able to proceed to the testing phase of the pretest. The testing phase consisted of 40 measurements. In each measurement, the participant heard a 1-s-long sample only once and was asked to identify the sample he or she heard as one of the possible eight Finnish vowels using the GUI. The participant was able to decline naming the signal if he or she was unable to identify the vowel. The sample presented contained either both the even and odd formant components of one of the eight noise-excited vowels or only the even or the odd formant components of the noise-excited vowel /æ/. Each of the resulting 10 options was measured four times, and the order in which the measurements were conducted was randomized for each participant. 2. Primary test

The three parts of the primary test were designed to address three different questions: (1) Are the participants able to identify the vowel when odd and even formant components are presented from different directions? (2) How many sounds do the participants hear in such scenarios? (3) From which direction do the participants hear the vowel is emitted from in such scenarios? A common set of test cases was employed in all three parts. In each test case, 22 loudspeakers (see Fig. 2) were used to emit incoherent 1-s-long samples of the masking noise [see Fig. 1(b)], creating a perception of a physical sound source distributed evenly around the listener. The overall A-weighted SPL of the masking noise at the listener position was 48 dB in all test cases. Depending on the test case, one or two of the loudspeakers emitting incoherent samples of the masking noise were also used to emit 800-ms-long samples of the even and odd formant components of the noise-excited vowel at different signal-to-noise ratios (SNRs). Parameter separation is used in this article to describe the angular separation between the 4512

J. Acoust. Soc. Am., Vol. 134, No. 6, December 2013

two components of the noise-excited vowel as seen from the listening position, while the parameter location is used to define the direction around which the two components are symmetrically presented from. Five different values (0 , 30 , 60 , 90 , and 120 ) were used for the parameter separation and two (0 and 90 ) for the parameter location. The factorial combinations of the values of the two aforementioned parameters are referred to as test scenarios in this article (see Table II). In order to test whether the identification of the vowel and the number of sounds heard depend on the SNR, as has been found in experiments of duplex perception (Bentin and Mann, 1990), the two components of the target sound were presented at five different SNR levels (17, 11, 5, 1, and 7 dB) in each test scenario. Consequently, the total number of test cases was 50 in each of the three parts of the primary test. In the first part of the primary test, the task of the participant was to identify the vowel(s) he or she heard in the presence of the masking noise. The participant was able to select either none, one, or several vowels of the eight Finnish vowels displayed using the GUI. The goal of this part was to study whether the participants are able to correctly identify the vowel /æ/ when the even and odd formant components are presented from different directions. Each test case was repeated twice in the first part of the primary test, resulting in 150 measurements in total due to the factorial combinations of 3 (repetition)  5 (SNR levels)  5 (spatial separation widths)  2 (locations). In the second part of the primary test, the participant was asked to report the number of sounds he or she heard in the presented stimuli in addition to the masking noise. The participant was instructed to count all kinds of additional sounds whether they were speech-like or not. The participant was required to select from a choice of 0, 1, 2, and 3 in the GUI. The purpose of this part was to gain knowledge about the number of auditory events the participants perceive in the different test cases. Three consecutive measurements of each test case were conducted in the second part of the primary test as well, and hence this part of the primary test also had a total of 150 measurements. TABLE II. Values of the parameters location and separation as well as the directions of the loudspeaker(s) used to reproduce the even and odd formant components of the noise-excited vowel in the 10 different test scenarios of the primary test. For the location parameter of 90 , the components were randomly presented either from the left or the right side (only positive directions are listed). Scenario 1 2 3 4 5 6 7 8 9 10

Location

Separation

Directions

0 0 0 0 0 90 90 90 90 90

0 30 60 90 120 0 30 60 90 120

0 615 630 645 660 90  75 and 105 60 and 120 45 and 135 30 and 150

Takanen et al.: Fusion of noise-excited vowels

In the third part of the primary test, the participant was asked to identify the loudspeaker(s) from which he or she perceived that the vowel(s) were emitted. The loudspeakers were identified via a GUI, where the options matched the loudspeaker layout employed in the test. The participant was able to mark no loudspeakers if he or she did not hear any vowels. The participant was instructed to remain still facing the loudspeaker at 0 whenever sounds were being played but was allowed to turn his or her head afterward in order to more accurately select the loudspeaker(s). The direction of each loudspeaker was also marked on the arc on which the loudspeakers were mounted. The aim of this part was to study from which direction the participants perceive the vowel being emitted when the even and odd formant components of the vowel are spatially separated. Only two runs of each test case were conducted in the third part of the primary test. Consequently, there were 100 measurements, resulting from the combinations of 2 (repetition)  5 (SNR levels)  5 (spatial separation widths)  2 (locations) in this part of the primary test. Although the aim of the study was to study the effect of the separation and location parameters on the fusion of the even and odd formant components only with the vowel /æ/, several precautions were taken to ensure that the participants would remain unaware of the aim. First of all, a random number of null measurements were also included in each of the three parts of the primary test. The corresponding test case in a null measurement was presented just as in an actual measurement, except that the two components of the noiseexcited vowel /æ/ were replaced with either a single or a pair of noise-excited vowels. The single vowel was any one of the eight noise-excited vowels, whereas the vowel pair was {/a/, /i/}, {/e/, /y/}, {/o/, /œ/}, {/a/, /o/}, or {/æ/, /i/}. Furthermore, the probability of conducting a null measurement in a given run was 0.4. The second precaution consisted of randomizing the order of the test cases separately for each participant and for each part of the primary test. Third, the target sound components were presented either from the left or right side of the listener when the value of the location parameter was 90 . Finally, the decision of which of the two components of the target sound was presented from the loudspeaker further to the left was always made randomly. E. Statistical analysis

The results of the pretest and the three parts of the primary test were analyzed with the analysis of variance (ANOVA) procedure. Two different regression models were employed in the analysis due to the difference in the nature of the response variable. The analysis of the results of the pretest and the first two parts of the primary test was designed to find whether there were significant differences in the detection rates between test cases. For this analysis, the number of correct detections made by the 18 participants during the given test was computed for each test case separately. Consequently, the response variables of the pretest and the first two parts of the primary test followed Poisson distributions, and generalized linear models were used as the regression models in the J. Acoust. Soc. Am., Vol. 134, No. 6, December 2013

analysis of those tests. It should be noted that the results of the second part of the primary test were not analyzed separately. On the contrary, they were combined with the results of the first part of the primary test, and in the data employed in the analysis, the correct detection corresponded to the simultaneous correct identification of the vowel /æ/ and perception of the presence of two sounds by the participant. The analysis of the third part of the primary test was required to discover the direction from which the participants perceived the vowel was emitted when the even and odd formant components were presented from different directions. For this analysis, the direction of the loudspeaker used to reproduce the odd formant components was stored for each run of the test. Then, two vectors n and m were constructed containing the values of the location parameter and the direction of the odd formant components, respectively. The values in m lay in the range from 60 to 60 , as the corresponding value of n was deducted from the perceived direction in order to obtain comparable data for the scenarios where the value of the location parameter was either 0 or 90 . Furthermore, no segregation was made between measurements where both components of the target sound were presented either from the left or the right side of the listener. In the analysis, odd and location were modeled as fixed variables and the participant was modeled as a random variable. Such a linear mixed-effects model was selected for the analysis due to its robustness (JacqminGadda et al., 2007), and due to the unbalanced nature of the data, a consequence of the participants not being required to indicate any directions if they did not hear any vowels. III. RESULTS A. Pretest

The results of the pretest were analyzed to find out how accurately the participants were able to identify the vowel when either both the even and odd formant components of a given noise-excited vowel or only one of the components for the vowel /æ/ were presented. An additional interest was to discover whether the sample, that is either the vowel (/a/, /e/, /i/, /o/, /u/, /y/, /æ/, and /œ/) or the even or the odd formant components of the vowel /æ/, had a significant effect on the recognition rate. A significant effect was found [v2 ¼ 117.3, d.f. ¼ 9, p < 0.001]. This effect is illustrated by the average recognition rates and the 95% confidence intervals (CIs) plotted in Fig. 3. The accuracy of the participants was not significantly above the chance level of 12.5% when only the even or the odd formant components were presented, whereas the recognition rate was above 95% when both the even and odd formant components of the vowel /æ/ were presented at the same time. Furthermore, the average recognition rate for the eight different noise-excited vowels was above 58% for each vowel. The results were similar between the different participants, and therefore there was no need to exclude any participants from the primary test. B. Primary test

The first part of the primary test concentrated on whether the subjects heard the vowel /æ/ when the even and Takanen et al.: Fusion of noise-excited vowels

4513

FIG. 3. Average recognition rates in the 10 different scenarios of the pretest. The dashed line represents the chance level of 12.5%. The error bars present the 95% confidence intervals.

odd formant components were presented from different directions. Only the SNR was found to have a significant effect on the detection rate of the vowel [v2 ¼ 725.3, d.f. ¼ 4, p < 0.001]. Hence, neither of the parameters separation and location affected the vowel recognition. Moreover, the average detection rates and the 95% CIs for the 10 test scenarios at different SNR levels plotted in Fig. 4 show that the recognition rate of 68% was exceeded in all 10 test scenarios at the two highest SNRs, although neither of the even or the odd formant components alone was found to be sufficient for the accurate identification of the vowel in the pretest (see Fig. 3). Another analysis was performed by inspecting the detection rates for identifying the vowel /æ/ correctly while perceiving the presence of two sounds at the same time. The analysis revealed that the SNR [v2 ¼ 127.3, d.f. ¼ 4, p < 0.001], the parameters separation [v2 ¼ 115.2, d.f. ¼ 4, p < 0.001] and location [v2 ¼ 293.1, d.f. ¼ 1, p < 0.001], as well as the interactions between SNR and location [v2 ¼ 74.4, d.f. ¼ 4, p < 0.001], and between separation and location [v2 ¼ 36.7, d.f. ¼ 4, p < 0.001] had a significant effect on the detection rate. The average detection rates and the 95% CIs illustrated in Fig. 5, reveal that the detection rate increased significantly when the SNR level and the spatial separation between the two components were increased, but only when the two components were presented symmetrically from the front, i.e., the value of the location parameter was 0 . The results for the third part of the primary test were analyzed to discover whether the direction of the odd formant components of the vowel /æ/ and the parameter location had an effect on the perceived direction of the vowel. The results indicated that both the direction of the odd formant components [F(8,124) ¼ 58.58, p < 0.001] and the value of the parameter location [F(1,9) ¼ 41.80, p < 0.001] as well as the interaction between the two [F(8,119) ¼ 72.88, p < 0.001] had a significant effect on the perceived direction of the vowel. The marginal means and the 95% CIs shown in Fig. 6(a) illustrate that the perceived direction of the vowel 4514

J. Acoust. Soc. Am., Vol. 134, No. 6, December 2013

FIG. 4. Average recognition rates of the vowel /æ/ in the 10 different scenarios listed in Table II as a function of the SNR. The error bars present the 95% confidence intervals.

linearly follows the direction from which the odd formant components were presented when both components were presented from the front of the listener, i.e., the value of the parameter location was 0 . On the other hand, the results shown in Fig. 6(b) illustrate that the vowel was perceived to be emitted from the frontmost loudspeaker when the two components were presented symmetrically from the same side of the listener. IV. DISCUSSION

Speech perception under spatially separated formant cues was investigated in the present study. Natural Finnish vowels were first divided into glottal source signals and vocal tract transfer functions using a glottal inverse-filtering algorithm. Noise-excited counterparts for the eight different vowels were generated from incoherent white-noise sequences using filters determined from the obtained glottal source signals and vocal tract transfer functions. The generated noise-excited vowels were thereafter divided into their even Takanen et al.: Fusion of noise-excited vowels

FIG. 5. Average detection rates for identifying the vowel /æ/ correctly while perceiving the presence of two sounds at the same time. The recognition rates are illustrated for the 10 different scenarios listed in Table II as a function of the SNR. The error bars present the 95% confidence intervals.

and odd formant components. In the listening experiments, the two components of the vowel /æ/ were presented from different directions around the listener in anechoic conditions using a multichannel loudspeaker setup to reproduce the components as well as the incoherent samples of the masking noise. Both the amount of spatial separation between the even and odd formant components and the directions of the components were varied. The first goal of the experiments was to study whether the correct vowel is perceived when the even and odd formant components are spatially separated. The results of the first part of the primary test showed that with the three highest SNRs the vowel /æ/ was correctly identified only when the even and odd formant components were simultaneously presented, demonstrating spectral fusion (Cutting, 1976) of the two components. The extent of spatial separation between the two components was not found to affect the recognition rate of the vowel. No differences were found between the test scenarios where the two components were presented symmetrically, either in the front or the side of the J. Acoust. Soc. Am., Vol. 134, No. 6, December 2013

FIG. 6. Average results for the perceived direction of the vowel /æ/ as a function of the direction from which the odd formant components were presented. The dotted and dashed lines illustrate the direction of the odd and even formant components, respectively. The results for the scenarios in which the even and odd formant components were presented from the front are shown in (a), whereas (b) shows the results for scenarios where both components were presented from the side. The offset of 90 is included in the values plotted in (b) for interpretation purposes. The error bars present the 95% confidence intervals.

listener. Additionally, a similar recognition rate of higher than 70% was achieved in all test scenarios with the highest SNR level. The reported ability of the participants to identify the vowel equally in all test scenarios is in line with previous studies on spectral fusion (Cutting, 1976) and duplex perception (Whalen and Liberman, 1987), where subjects have been shown to be able to identify the speech stimulus even when the two components of the stimulus were presented to different ears over headphones. The second goal of the study was to find out whether the spatial separation of the even and odd formants would result in perception of two auditory events. Therefore, an analysis was performed to discover whether a secondary auditory event was perceived at the same time when the vowel was correctly identified. Such a phenomenon was found to occur only when the odd and even formant components of the Takanen et al.: Fusion of noise-excited vowels

4515

vowel /æ/ were spatially separated and presented symmetrically in front of the listener. The found phenomenon is similar to duplex perception (Whalen and Liberman, 1987). Furthermore, an increase in the spatial separation between the components was found to result in an increase in the detection rate, which is understandable considering that the ability to segregate concurrent sound sources improves when the spatial separation between them is increased (Blauert, 1997). However, only one auditory event corresponding to the vowel was perceived when the two components were presented symmetrically at the side of the listener. The vowel was found to be perceived from the direction of the odd formant components when the two components were presented symmetrically from the front. This may be explained by the larger level of low-frequency energy in the odd components, an observation that was further confirmed by an informal listening test in which the odd components were perceived to sound more vowel-like. However, the result cannot be generalized to cover all vowels as only the vowel /æ/ was applied to study this aspect in the test. When the two components were presented with the axis of symmetry along the side of the listener, the perceived direction of the vowel was found to follow the direction of the frontmost component, and to lie between the directions of the two components, which is in accordance with the “consistent-object” hypothesis (Schwartz and Shinn-Cunningham, 2010; Best et al., 2007). Inspection of the responses provided by the participants additionally revealed that only one loudspeaker was perceived to be emitting vowel sounds when the even and odd formant components were presented simultaneously in the different test cases, even when the components were presented from different sides of the midline. Consequently, the other auditory event that the subjects perceived besides the vowel in such scenarios was not associated with any vowel. The perceived direction of this additional auditory image was not addressed in the present study. However, since the perceived direction of the vowel in such scenarios was found to follow the direction from which the odd formant components were presented, it can be speculated that the even formant components of the vowel /æ/ contributed to the perception of the vowel and also evoked the perception of an additional auditory image. When considering the neurophysiological knowledge of the human auditory pathway, the spatial information of different sound events is thought to be analyzed in the where processing stream of the auditory cortical processing, while the sound spectrum is thought to be analyzed in the what processing stream (Rauschecker and Tian, 2000). Furthermore, the spatial receptive fields measured in free-field conditions have found supporting evidence for the opponent/hemifield-coding of a horizontal sound source location. More specifically, most neurons are sensitive to a wide range of locations limited to a single hemisphere (Leiman and Hafter, 1972; Aitkin et al., 1984; Groh et al., 2003; Stecker et al., 2005). Hence, the observations in the present study lead to the idea that the vowel is identified correctly due to the fusion of the two components of the vowel in the what stream, and that the additional auditory image is perceived only if the two components 4516

J. Acoust. Soc. Am., Vol. 134, No. 6, December 2013

evoke conflicting directional cues in opposite hemispheres in the where processing stream. For instance, Belin et al. (2000) have found evidence for voice-selective areas in the human auditory cortex, and there is also evidence for vowel-identityspecific cortical areas (M€akel€a et al., 2003; Shestakova et al., 2004; Obleser et al., 2006). Moreover, certain cortical areas have been found to be specifically recruited in localization tasks (Zatorre et al., 2002; Deouell et al., 2007). Consequently, a potential sequel for the present investigation could be a study in which brain imaging is used to analyze whether the fusion of the components and the perceived directions of the auditory images are visible in the responses arising from the left and right auditory cortices.

V. CONCLUSIONS

The study shows that spectral fusion occurs in vowel perception when the even and odd formant components are simultaneously presented. Consequently, the vowel is correctly identified regardless of the spatial separation between the two components, although neither of the components is by itself sufficient for accurate identification of the vowel. Moreover, it was found that only a single auditory event corresponding to the vowel is perceived when both components are presented symmetrically on the side of the listener, whereas an additional auditory event may be perceived when the two components are presented symmetrically from the front of the listener. Furthermore, the probability that both the vowel and the additional auditory event are perceived at the same time rises when the spatial separation between the two components increases. The study also showed that the vowel is perceived from the direction of one of the components, and therefore the other component contributes to the vowel identity as well as evokes the perception of an additional auditory event. In conclusion, the study supports the interpretation that the processing streams of the auditory pathway are fused for the identification of the vowel, but two auditory images are perceived when there are conflicting directional cues in the where processing streams. The result gives more insight into auditory scene analysis and spatial hearing in general, and to how the directional and spectral information can be combined in binaural auditory models. ACKNOWLEDGMENTS

This work has been supported by the Academy of Finland and the Walter Ahlstr€om foundation. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013)/ERC Grant agreement No. 240453 and Grant agreement No. 287678. Aitkin, L. M., Gates, G. R., and Phillips, S. C. (1984). “Responses of neurons in inferior colliculus to variations in sound-source azimuth,” J. Neurophysiol. 52, 1–17. Alku, P. (1992). “Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering,” Speech Commun. 11, 109–118. Alku, P., Sivonen, P., Palom€aki, K., and Tiitinen, H. (2001). “The periodic structure of vowel sounds is reflected in human electromagnetic brain response,” Neurosci. Lett. 298, 25–28. Takanen et al.: Fusion of noise-excited vowels

Alku, P., Tiitinen, H., and N€a€at€anen, R. (1999). “A method for generating natural-sounding speech stimuli for cognitive brain research,” J. Clin. Neurophysiol. 110, 1329–1333. Belin, P., Zatorre, R. J., Lafaille, P., Ahad, P., and Pike, B. (2000). “Voiceselective areas in human auditory cortex,” Nature 403, 309–312. Bentin, S., and Mann, V. (1990). “Masking and stimulus intensity effects on duplex perception: A Confirmation of the dissociation between speech and nonspeech modes,” J. Acoust. Soc. Am. 88, 64–74. Best, V., Gallun, F. J., Carlile, S., and Shinn-Cunningham, B. (2007). “Binaural interference and auditory grouping,” J. Acoust. Soc. Am. 121, 1070–1076. Blauert, J. (1997). Spatial Hearing. The Psychophysics of Human Sound Localization, 2nd ed. (MIT Press, Cambridge, MA), pp. 37–50, 257271. Bregman, A. S. (1994). Auditory Scene Analysis: The Perceptual Organization of Sound (MIT Press, Cambridge, MA), pp. 47–394. Broadbent, D. E. (1955). “A note on binaural fusion,” Q. J. Exp. Psychol. 7, 46–47. Broadbent, D. E., and Ladefoged, P. (1957). “On the fusion of sounds reaching different sense organs,” J. Acoust. Soc. Am. 29, 708–710. Cooke, M., and Ellis, D. P. W. (2001). “The auditory organization of speech and other sources in listeners and computational models,” Speech Commun. 35, 141–177. Culling, J. F., and Summerfield, Q. (1995). “Perceptual segregation of concurrent speech sounds: absence of across-frequency grouping by common interaural delay,” J. Acoust. Soc. Am. 98, 785–797. Cutting, J. E. (1976). “Auditory and linguistic processes in speech perception: Interferences from six fusions in dichotic listening,” Psychoacoust. Rev. 83, 114–140. Darwin, C. J., and Hulkin, R. W. (2004). “Limits to the role of a common fundamental frequency in the fusion of two sounds with different spatial cues,” J. Acoust. Soc. Am. 116, 502–506. Deouell, L. Y., Heller, A. S., Malach, R., D’Esposito, M., and Knight, R. T. (2007). “Cerebral responses to change in spatial location of unattended sounds,” Neuron 55, 985–996. Gardner, R. B., and Darwin, C. (1986). “Grouping of vowel harmonics by frequency modulation: Absence of effects on phonemic categorization,” Percept. Psychophys. 40, 183–187. Griffins, T. D., and Warren, J. D. (2004). “What is an auditory object,” Nat. Rev. Neurosci. 5, 887–892. Groh, J. M., Kelly, K. A., and Underhill, A. M. (2003). “A monotonic code for sound azimuth in primate inferior colliculus,” J. Cogn. Neurosci. 15, 1217–1231. Ito, T., Takeda, K., and Itakura, F. (2005). “Analysis and recognition of whispered speech,” Speech Commun. 45, 139–152. Jacqmin-Gadda, H., Sibillot, S., Proust, C., Molina, J.-M., and Thiebaut, R. (2007). “Robustness of the linear mixed model to misspecified error distribution,” Comput. Stat. Data Anal. 51, 5142–5154. Katz, W. F., and Assmann, P. F. (2001). “Identification of children’s and adults’ vowels: intrinsic fundamental frequency, fundamental frequency dynamics, and presence of voicing,” J. Phonetics 29, 23–51.

J. Acoust. Soc. Am., Vol. 134, No. 6, December 2013

Leiman, A. L., and Hafter, E. R. (1972). “Responses of inferior colliculus neurons to free field auditory stimuli,” Exp. Neurol. 35, 431–449. Liberman, A. M., and Mattingly, I. G. (1989). “A specialization for speech perception,” Science 243, 489–494. Ling, Z.-H., Wu, Y.-J., Wang, Y.-P., Qin, L., and Wang, R.-H. (2006). “USTC system for Blizzard Challenge 2006 an improved HMM-based speech synthesis method,” in Proc. of the Blizzard Challenge Workshop, Pittsburgh, PA. Lyzenga, J., and Moore, B. C. J. (2005). “Effect of frequency-modulation coherence for inharmonic stimuli: Frequency-modulation phase discrimination and identification of artificial double vowels,” J. Acoust. Soc. Am. 117, 1314–1325. M€akel€a, A. M., Alku, P., and Tiitinen, H. (2003). “The auditory N1m reveals the left hemispheric representation of vowel identity in humans,” Neurosci. Lett. 353, 111–114. Obleser, J., Boecker, H., Dzerga, A., Haslinger, B., Hennenlotter, A., Roettinger, M., Eulitz, C., and Rauschecker, J. P. (2006). “Vowel sound extraction in anterior superior temporal cortex,” Hum. Brain Mapp. 27, 562–571. Rand, T. C. (1974). “Dichotic release from masking for speech,” J. Acoust. Soc. Am. 55, 678–680. Rauschecker, J. P., and Tian, B. (2000). “Mechanisms and streams for processing of ‘what’ and ‘where’ in auditory cortex,” Proc. Natl. Acad. Sci. U.S.A. 97, 11800–11806. Remez, R. E., Rubin, P. E., Pisoni, D. B., and Carrell, T. (1981). “Speech perception without traditional speech cues,” Science 212, 947–950. Scheffers, M. T. M. (1983). “Sifting vowels. Auditory pitch analysis and sound segregation,” Ph.D. thesis, University of Groningen. Schwartz, A. H., and Shinn-Cunningham, B. G. (2010). “Dissociation of perceptual judgments of what and where in an ambiguous auditory scene,” J. Acoust. Soc. Am. 128, 3041–3051. Shestakova, A., Brattico, E., Soloviev, A., Klucharev, V., and Huotilainen, M. (2004). “Orderly cortical representation of vowel categories presented by multiple exemplars,” Cogn. Brain Res. 21, 342–350. Shinn-Cunningham, B. G. (2008). “Object-based auditory and visual attention,” Trends Cogn. Sci. 12, 182–186. Shinn-Cunningham, B. G., Lee, A. K. C., and Oxenham, A. J. (2007). “A sound element gets lost in perceptual competition,” Proc. Natl. Acad. Sci. U.S.A. 104, 12223–12227. Snyder, J. S., Gregg, M. K., Weintraub, D. M., and Alain, C. (2012). “Attention, awareness, and the perception of auditory scenes,” Front. Psychol. 3, 1–15. Soong, F. K., and Juang, B.-H. (1984). “Line spectrum pair (LSP) and speech data compression,” in Proc. of the IEEE Intl. Conf. on ICASSP’84, San Diego, CA, pp. 37–40. Stecker, G. C., Harrington, I. A., and Middlebrooks, J. C. (2005). “Location coding by opponent neural populations in the auditory cortex,” PLoS Biol. 3, 520–528. Whalen, D. H., and Liberman, A. M. (1987). “Speech perception takes precedence over non-speech perception,” Science 237, 169–171. Zatorre, R. J., Bouffard, M., Ahad, P., and Belin, P. (2002). “Where is ‘where’ in the human auditory cortex?,” Nat. Neurosci. 5, 905–909.

Takanen et al.: Fusion of noise-excited vowels

4517

Copyright of Journal of the Acoustical Society of America is the property of American Institute of Physics and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

Fusion of spatially separated vowel formant cues.

Previous studies on fusion in speech perception have demonstrated the ability of the human auditory system to group separate components of speech-like...
684KB Sizes 0 Downloads 4 Views