Perceptual & Motor Skills: Perception 2013, 117, 3, 903-912. © Perceptual & Motor Skills 2013

PROCESSING FEMALE AND MALE VOICES: A WORD SPOTTING EXPERIMENT1 ERWAN PÉPIOT University Paris 8 Summary.—Several previous studies showed that synthetic vowel identification is more difficult for voices with a high f0 (the lowest frequency that defines voice pitch), but it is not clear whether this means that female voices, which generally have a higher f0, are processed more slowly than male voices. A word spotting experiment was conducted with 25 French native listeners (8 men, 17 women; M age = 27.6 yr., SD = 10.8). Words produced by four male and four female speakers were played to the participants. Their task was to press a button every time they identified the target word “étage.” Response times were collected and compared in four different conditions: male voice preceded by male voices, female voice preceded by female voices, male voice preceded by female voices, and female voice preceded by male voices. Results showed that both sexes' voices were processed equally fast. Moreover, no significant correlation was found between mean f0 of the target word and response time. Nevertheless, when a target word produced by a male speaker occurred after several words produced by a female speaker (or vice-versa) the listener's RT decreased, suggesting that male and female voices are processed as two different entities.

Female voices have long been neglected by phoneticians, especially in vocalic formant studies (Ferragne & Pellegrino, 2010). This could be partly explained by the fact that females' formants are generally harder to locate in spectrum and spectrograms, due to a higher f0 (the lowest frequency that defines the pitch of the voice), implying a lower spectral density (Takefuta, Jancosek, & Brunt, 1972; Boë, Contini, & Rakotofiringa, 1975). Moreover, a number of studies have brought to light other cross-sex acoustic differences, such as vowel formants that tend to be located at higher frequencies in female speakers (Hillenbrand, Getty, Clark, & Wheeler, 1995; Whiteside, 1998b). The spectral characteristics of consonants also differ as a function of the speaker's sex (Whiteside, 1998a) as well as voice onset time, which corresponds to the period between the release of a stop consonant and the onset of voicing (Swartz, 1992; Pépiot, in press), phonation type (Klatt & Klatt, 1990; Pépiot, in press) and speech rate (Byrd, 1994). This raises the question whether these acoustic differences may affect speech processing in listeners. Sokhi, Hunter, Wilkinson, and Woodruff (2005) and Lattner, Meyer, and Friederici (2005) used fMRI to study brain activation during the processing of sentences produced by female and male speakers. They found Address correspondence to Erwan Pépiot, 80 rue du 4 septembre, 78800 Houilles, France or e-mail ([email protected]). 1

DOI 10.2466/24.27.PMS.117x31z7

18-PMS_Pepiot_130138.indd 903

ISSN 0031-5125

21/01/14 6:27 PM

904

E. PÉPIOT

that these two types of voices generate different brain responses: some areas were more activated by female voices and others by male voices. These results suggest that male and female voices are processed differently by listeners. Nonetheless, in a more recent study, Charest, Pernet, Latinus, Crabbe, and Belin (2012) did not find differences in brain response for these two types of voices when processing monosyllabic words. Moreover, it was shown that listeners adjust their phonemic boundaries as a function of the speaker's sex, expecting higher resonant frequencies in female voices (Johnson, Strand, & D'Imperio, 1999; Munson, 2011). Concerning processing difficulty, a study by Ryalls and Lieberman (1982) played isolated synthetic vowels to listeners, who had to identify them. Vowels were played at different f0 levels: 100, 135, or 250 Hz. In all cases, vowels with f0 at 100 or 135 Hz were more correctly identified than those at 250 Hz. According to the authors, a high spectral density (i.e., a low f0) facilitates formant detection and consequently vowel identification. A similar study conducted by Diehl, Lindblom, Hoemeke, and Fahey (1996) reached the same conclusion. Considering these findings, one could hypothesize that speech processing time is related to the f0 of the voice being processed; thus it should be longer for female voices. In a study on American English listeners (Strand, 2000), isolated words produced by male or female speakers were presented to the participants, who had to repeat the word as fast as they could (a speeded naming task). Response times (RT) were measured and compared in four conditions: stereotypical male voice,2 stereotypical female voice, non-stereotypical (i.e., ambiguous regarding sex) male voice, and non-stereotypical female voice. No significant difference in RT was found between stereotypical male and female voices (631.30 msec. and 631.26 msec., respectively). Unsurprisingly, ambiguous voices (both male and female) produced significantly higher RTs. Such results suggest that female voices would not be processed more slowly than male voices. However, only one stereotypical male voice and one stereotypical female voice were used in Strand's (2000) experiment, which strongly limits the scope of that study. Moreover, the speeded naming paradigm of that study means that the RTs reflect both speech perception time and speech production time. A word spotting experiment, which does not use speech production, appears to be much more adequate. Therefore, an experiment was done to test the two following hypotheses. Hypothesis 1. Female and male voices are processed differently by listeners. Therefore, RT for processing a female voice will differ The speakers’ voices were previously evaluated by 24 listeners to establish which ones were more or less stereotypic of the speaker’s sex. 2

18-PMS_Pepiot_130138.indd 904

21/01/14 6:27 PM

PROCESSING VOICES

905

depending on whether it is heard within a context of male voices or female voices (and vice versa for processing a male voice). Hypothesis 2. Female voices are processed more slowly than male voices, due to their higher f0. Therefore, words produced by female speakers will get higher RTs than those produced by male speakers (provided that other variables are kept constant). METHOD Participants Twenty-five listeners took part in the experiment: 8 men and 17 women. Their age ranged from 18 to 65 years (M = 27.6, SD = 10.8; males M = 36.1, SD = 14; females M = 23.6, SD = 6.1). All participants were native speakers of Parisian French with no reported speech or hearing disorder. They were recruited at Paris 8 University and received a USB memory stick for their participation in the experiment. Moreover, they were informed that the data from the experiment would be treated with confidentiality. Stimuli The word spotting paradigm (Marslen-Wilson & Tyler, 1980) is an auditory perception experiment which consists of detecting a target word embedded in a sentence or in a series of separated words. The participant's task is to push a button as fast as possible each time he/she identifies this word. For the current experiment, frequent dissyllabic words with neutral emotional content were used. Sixty-one French words were selected, including one target word. The word étage (meaning floor) was chosen to be the target, because of its initial vowel [e], which allows large crosssex formant differences. See Appendix for stimulus list. The words were recorded from 8 Parisian French speakers (4 women, 4 men; ages 20 to 34 yr., M = 25.1, SD = 5.2). They were all non-smokers and had no reported speech disorders. Recordings took place in a quiet room, with a digital recorder. In order to hold prosodic parameters constant, each word was framed into a sentence: “Il a dit ____ deux fois” (“He said ____ twice”). Following the recording, the words were extracted from this context. Procedure A word spotting experiment requires the control of particular variables, and the neutralization of several potential biases. Four experimental conditions were used: Male homogeneous (condition A): male voices context before the target word produced by a male voice. Female homogeneous (condition B): female voices context before the target word produced by a female voice. Male non-homogeneous (condition C): female voices context before the target word produced by a male voice. Female non-homogeneous

18-PMS_Pepiot_130138.indd 905

21/01/14 6:27 PM

906

E. PÉPIOT

(condition D): male voices context before the target word produced by a female voice. In order to maximize its effect, the context extended not only to the four non-target words played before the target word in the experimental series of words, but also to the pre-experimental series (which is three to four words long, including the target word). Distractor series of variable lengths, also ending with the target word (on which RT was not taken into account), were also used. Thus, a basic scheme was repeated: two distractor series were presented, then one pre-experimental series and one experimental series. Examples are provided in Table 1. TABLE 1 EXTRACT FROM THE EXPERIMENTAL PLAN. THE LEFT COLUMN INDICATES THE SERIES TYPE (TOP-DOWN CHRONOLOGICAL ORDER). THE SEX OF THE SPEAKER WHO PRODUCED EACH WORD (F FOR FEMALE, M FOR MALE) IS INDICATED IN ITALICS. TARGET WORDS AND THE ENTIRE PREEXPERIMENTAL AND EXPERIMENTAL SERIES ARE IN BOLDFACE CHARACTERS Series Type Distractor Distractor

Word 1

Word 2

silence

étage

F voice

M voice

pourquoi

tourner

Word 3

Word 4

Word 5

Word 6

compter

instant

abeille

étage

F voice

M voice

F voice

M voice

M voice

F voice

Pre-experimental (Cond. A)

tirer

moment

étage

M voice

M voice

M voice

Experimental (Cond. A)

maison

facile

cadeau

chemin

étage

M voice

M voice

M voice

M voice

M voice

Distractor

humain

étage

M voice

F voice

quitter

sentir

matin

argent

étage

M voice

F voice

F voice

M voice

F voice

Pre-experimental (Cond. C)

debout

passer

étage

F voice

F voice

F voice

Experimental (Cond. C)

depuis

action

savoir

creuser

étage

F voice

F voice

F voice

F voice

M voice

Distractor

The order of the four conditions within an experimental unit was counterbalanced. Unit 1 included the conditions in the following order (with the letters A, B, C, and D representing the conditions): ABCD (i.e., two distractor series, a pre-experimental series, the experimental series condition A, two distractor series, a pre-experimental series, the experimental series condition B, etc.). Units 2–4 were similar to Unit 1, with the exception of the order of the conditions which were: BADC, DCBA and

18-PMS_Pepiot_130138.indd 906

21/01/14 6:27 PM

PROCESSING VOICES

907

CDBA, respectively. Each condition was therefore tested four times during the experiment, in all possible orders (1, 2, 3, or 4). A single target word (étage) was used during the experiment, in order to limit some biases that may arise with multiple target words (e.g., differences in RTs only due to the acoustic structure of the different words). Consequently, another point had to be considered: since participants had to repeatedly detect the same word, it is likely that they would improve their RT as they progressed in the experiment. To compensate for this possible bias, the four units were played in the order 1, 2, 3, 4 to half of the participants, and in the order 3, 4, 1, 2 to the other half. In addition to this precaution, a statistical check was performed a posteriori. Every non-target word appeared once in each unit, always in a different order. The distribution of the eight different recorded voices was made according to several rules, in order to limit biases as much as possible. During the entire experiment, each voice appeared twice as an experimental target word. Moreover, each voice could not appear more than twice in the same series of words and never twice in a row. The experiment was conducted on a computer, by using the software Perceval 3.0.5.0 (André, Ghio, Cavé, & Teston, 2003) and a button box. Participants sat in front of the computer and wore headphones. The instructions (i.e., the task) were presented on the screen. Their task was to push a button as fast as they could each time they identified the word étage. Initially, six warm-up series were played, followed by the four units of the experiment. No visual stimulus was displayed on the screen at any time. Audio stimuli (i.e., the words) were played with an inter-stimulus interval (ISI) of 600 msec. The volume of the audio presentation was kept constant. Measures and Analyses Reaction times (calculated from the beginning of presentation of the target word until the participant pressed the button) were recorded automatically and saved in a text file by Perceval. Sixty-four RTs per participant were collected (all occurrences of the target word in the experiment, including those in the distracter, pre-experimental, and experimental series). However, only those played in the experimental series were included in the analysis. Thus, 16 measurements were taken per participant (4 in each experimental condition). The total number of RTs that were included in the analysis was 400 (4 RTs per condition × 25 participants × 4 conditions) Mean RTs were compared with ANOVAs. Correlations between RT and several variables (word length, f0 or learning) were conduced to test for potential effects. RESULTS No data were lost. All RTs were within a reasonable range going from 262 to 717 msec. (i.e., none was abnormally long or short).

18-PMS_Pepiot_130138.indd 907

21/01/14 6:27 PM

908

E. PÉPIOT

A two-factor ANOVA (listener's sex × experimental condition) revealed that there was no interaction between these factors (F3, 392 = 0.30, p = .83). This suggests that relative differences in RT between the four conditions did not vary as a function of the listener's sex. Thus, the analysis of RTs was conducted on the entire group of listeners, with no sex distinction. The target word's duration, as well as its fundamental frequency, varied slightly from one speaker to another. A detailed description of each stimulus (duration, mean f0, and f0 range3) is presented in Table 2. To ensure that these variations did not affect the results, Pearson correlations were calculated between the target word duration, mean f0, and f0 range with the RTs. Low and non-significant correlations were found: r6 = .21, z = 0.47, p = .64; r6 = –.37, z = –0.87, p = .39, and r6 = –.24, z = 0.55, p = .58 for duration, mean f0, and f0 range, respectively. Thus, stimulus duration and f0 do not seem to have influenced participants' RT. TABLE 2 MEAN F0, F0 RANGE, AND DURATION OF 8 OCCURRENCES OF THE TARGET WORD ÉTAGE PRONOUNCED BY 4 FEMALE AND 4 MALE SPEAKERS Speaker

Mean f0 of the Target Word (Hz)

f0 range of the Target Word (Hz)

Duration of the Target Word (msec.)

F1

239

49

510

F2

187

71

605

F3

202

47

607

F4

226

78

540

M1

127

49

475

M2

135

39

500

M3

150

61

542

M4

143

73

558

As previously mentioned, the use of a same target word during the entire experiment could have caused a progressive shortening of the RT (i.e., learning effect). A Pearson correlation was conducted to find potential correlations between RT and the order of occurrence of the target word. No significant correlation between these two factors was found: r398 = –.02, z = –0.37, p = .71. The repetition of a same target word (étage) did not entail a significant decrease in the RT. Mean RTs for the detection of the target word in the four experimental conditions are presented in Table 3. A two-factor ANOVA: homogeneity (homogenous, non-homogenous) and speaker's sex (female, male), was 3 Range of f0 corresponds to the difference between the highest and the lowest frequency reached within the word.

18-PMS_Pepiot_130138.indd 908

21/01/14 6:27 PM

909

PROCESSING VOICES TABLE 3 LISTENER'S MEAN RESPONSE TIME (MSEC.) AND STANDARD DEVIATION (MSEC.) AS A FUNCTION OF EXPERIMENTAL CONDITION, WITH POST HOC COMPARISONS BETWEEN CONDITIONS Condition

Response Time, msec. M

SD

Male homogeneous (A)

502

103

Female homogeneous (B)

495

107

Male non-homogeneous (C)

478

94

Female non-homogeneous (D)

474

105

Post hoc comparisons

A>C B>D

conducted on listeners' RTs. There was a significant overall effect of homogeneity (F1, 396 = 4.57, p = .03) with a large effect size ( f = 0.56). RTs for both female and male voices were significantly shorter in non-homogeneous conditions (C and D), in which words before the target word were pronounced by speakers of the opposite sex, thereby supporting Hypothesis 1. Overall, mean response time was 484 msec. (SD = 106; SE = 8) for female voices (conditions B and D), and 490 msec. (SD = 99; SE = 7) for male voices (conditions A and C). The ANOVA showed that there was no significant effect of the speaker's sex (F1, 396 = 0.29, p = .59). Thus, Hypothesis 2 was not supported: overall, there was no significant difference between RTs for female and male voices. DISCUSSION This word spotting experiment showed interesting results. There was no significant difference between the processing times of words produced by female and male speakers. Thus, isolated words produced by female and male voices seem to be processed at the same speed by listeners. This was true independent of the listener's sex. Additionally, no significant correlation was found between the mean f0 of the target word and RT, suggesting that regardless of the speaker's sex, a word produced with a high f0 would not take longer to process than a word with a low f0. The results are partially consistent with previous studies, but different in interesting ways. Ryalls and Lieberman (1982), as well as Diehl, et al. (1996), showed that identification of isolated synthesized vowels was more difficult when they presented a high f0. This could be interpreted as female voices being more difficult to process auditorily. These authors measured a percentage of correct identifications, not a response time. Moreover, they used synthetic stimuli: (un)naturalness might have strongly influenced the results, especially when vowels were synthesized with a very high f0. Finally, the linguistic unit they used, an isolated vowel, might have seemed artificial: apart from experimental conditions, listeners rarely have to identify such a short unit presented out of context.

18-PMS_Pepiot_130138.indd 909

21/01/14 6:27 PM

910

E. PÉPIOT

Therefore, it is plausible that listeners develop similar processing abilities for male and female voices, at least in the case of isolated word stimuli, because they have been accustomed to these two types of voices since the beginning of their lives. Thus, even if vowels with high f0 were more difficult to process, a claim that was not strictly established, listeners could compensate with consonants, which are known to be particularly decisive in lexical access (Owren & Cardillo, 2006). Strand (2000) used isolated words to measure participants' RT as a function of the type of voice (male or female), which was also done in the current study. Nonetheless, that experiment consisted of a speeded naming task, which involves both perception and production. Despite this methodological difference, results in Strand (2000) are quite similar to those obtained here: no significant RT difference was found between female and male voices. Moreover, in that study, only one male and one female ‘stereotypical’ voice were used; it was necessary to confirm these tendencies by using a greater number of voices. The second main observation concerns the differences found between homogeneous and non-homogeneous conditions. Non-homogeneous conditions (C and D) entailed lower RTs compared to homogeneous ones (A and B). In other words, when the context was made up of words produced by female voices before a target word produced by a male speaker (or vice versa), the listeners' RTs decreased. This is probably due to increased attention, caused by a perceived change of paradigm (pop-out effect), which could be due merely to the switch of the speaker's sex. It might suggest that listeners process female and male voices as two different entities. These results are consistent with those obtained by Sokhi, et al. (2005) and Lattner, et al. (2005), showing that female and male voices activate different areas of the listeners' brain, and with those from Johnson, et al. (1999), proving that listeners adjust their phonemic boundaries as a function of the speaker's sex. Considering the small number of RTs taken into account in the present study (4 per condition for each participant), a larger study could be done in order to confirm these results. Furthermore, this study was conducted with small and isolated linguistic units, which might limit its scope. Further research on female and male speech processing might be conducted with larger units, such as sentences or discourse, using a different experimental paradigm. REFERENCES

ANDRÉ, C., GHIO, A., CAVÉ, C., & TESTON, B. (2003) Perceval: a computer-driven system for experimentation on auditory and visual perception. Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona, Spain. Pp. 1421-1424. BOË, L-J., CONTINI, M., & RAKOTOFIRINGA, H. (1975) Étude statistique de la fréquence laryngienne. Phonetica, 32, 1-23.

18-PMS_Pepiot_130138.indd 910

21/01/14 6:27 PM

PROCESSING VOICES

911

BYRD, D. (1994) Relations of sex and dialect to reduction. Speech Communication, 15, 3954. CHAREST, I., PERNET, C., LATINUS, M., CRABBE, F., & BELIN, P. (2012) Cerebral processing of voice sex studied using a continuous carryover fMRI design. Cerebral Cortex, 23, 958-966. DIEHL, R. L., LINDBLOM, B., HOEMEKE, K. A., & FAHEY, R. P. (1996) On explaining certain male-female differences in the phonetic realization of vowel categories. Journal of Phonetics, 24, 187-208. FERRAGNE, E., & PELLEGRINO, F. (2010) Formant frequencies of vowels in 13 accents of the British Isles. Journal of the International Phonetic Association, 40, 1-34. HILLENBRAND, J., GETTY, L. A., CLARK, M. J., & WHEELER, K. (1995) Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97, 30993111. JOHNSON, K., STRAND, E., & D'IMPERIO, M. (1999) Auditory-visual integration of talker sex in vowel perception. Journal of Phonetics, 27, 359-384. KLATT, D. H., & KLATT, L. C. (1990) Analysis, synthesis and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America, 87, 820-857. LATTNER, S., MEYER, M. E., & FRIEDERICI, A. D. (2005) Voice perception: sex, pitch, and the right hemisphere. Human Brain Mapping, 24, 11-20. MARSLEN-WILSON, W., & TYLER, L. K. (1980) The temporal structure of spoken language understanding. Cognition, 8, 1-71. MUNSON, B. (2011) The influence of actual and imputed talker sex on fricative perception, revisited (L). Journal of the Acoustical Society of America, 130, 2631-2634. OWREN, M., & CARDILLO, G. (2006) The relative roles of vowels and consonants in discriminating talker identity vs word meaning. Journal of the Acoustical Society of America, 119, 1727-1739. PÉPIOT, E. (in press) Voice, speech and gender: male-female acoustic differences and cross-language variation in English and French speakers. Actes des Rencontres Jeunes Chercheurs 2011 et 2012 de l'ED 268. RYALLS, J. H., & LIEBERMAN, P. (1982) Fundamental frequency and vowel perception. Journal of the Acoustical Society of America, 72, 1631-1634. SOKHI, D. S., HUNTER, M. D., WILKINSON, I. D., & WOODRUFF, P. W. (2005) Male and female voices activate distinct regions in the male brain. NeuroImage, 27, 572-578. STRAND, E. (2000) Sex stereotype effects in speech processing. Unpublished doctoral dissertation, The Ohio State Univer. SWARTZ, B. L. (1992) Sex difference in voice onset time. Perceptual & Motor Skills, 75, 983-992. TAKEFUTA, Y., JANCOSEK, E. G., & BRUNT, M. (1972) A statistical analysis of melody curves in the intonation of American English. Proceedings of the 7th International Congress of Phonetic Sciences, Montreal, Canada. Pp. 1035-1039. WHITESIDE, S. P. (1998a) Identification of a speaker's sex: a fricative study. Perceptual & Motor Skills, 86, 587-591. WHITESIDE, S. P. (1998b) Identification of a speaker’s sex: a study of vowels. Perceptual & Motor Skills, 86, 579-584. Accepted November 27, 2013.

18-PMS_Pepiot_130138.indd 911

21/01/14 6:27 PM

912

E. PÉPIOT APPENDIX

LIST OF THE WORDS USED IN THE EXPERIMENT Target word : étage. Non-target words: abeille, action, ami, après, argent, article, avance, cadeau, cerveau, chanter, chemin, cheveu, compter, creuser, debout, depuis, dimanche, facile, façon, falloir, glaçon, humain, immense, instant, maison, marché, mari, marteau, matin, moitié, moment, moyen, objet, offrir, pareil, parent, partie, passer, photo, pourquoi, pouvoir, quitter, savoir, sentir, sérieux, service, silence, tableau, tirer, tourner, utile, visage, voiture. Non-target words used in warm-up series: choisir, heureux, jamais, million, oser, oubli, plusieurs.

18-PMS_Pepiot_130138.indd 912

21/01/14 6:27 PM

Copyright of Perceptual & Motor Skills is the property of Ammons Scientific, Ltd. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

Processing female and male voices: a word spotting experiment.

Several previous studies showed that synthetic vowel identification is more difficult for voices with a high f0 (the lowest frequency that defines voi...
197KB Sizes 21 Downloads 3 Views