Phonetica 31: 185-197 (1975)

Contributions of Fundamental Frequency and Formant Frequencies to Speaker Identification C onrad LaR iviere Speech Science Laboratory, University of Missouri - Kansas City, Kansas City, Mo. Abstract. This experiment in aural speaker identification attempts to ascertain the relative contributions of fundamental frequency and formant frequencies. Speech samples were four voiced, whispered, and low-pass filtered isolated vowels produced by eight male speakers. Twelve listeners were asked to make speaker identification choices from these stimuli. Utterances were subjected to acoustic analyses, and the parameters extracted were related to speaker identification confusions by rank order correlation techniques. Results indicate that both fundamental frequency and formant frequencies contribute (approximate­ ly equally) to speaker identification judgments.

Introduction

Downloaded by: Monash University 130.194.20.173 - 9/16/2017 2:15:52 PM

Recently there has been considerable interest concerning speaker identification from spectrographic representations of voice (voiceprints). The forensic implications of such a technique are obvious, but there has been some skepticism regarding its validity [Bolt et al., 1970]. Prominent among the acoustic characteristics displayed on spectro­ grams are fundamental frequency and formant frequencies. If, as Tosi et al. [1972] suggest, spectrograms are to be used forensically to objectively display similarities among voices, it might be useful to have data bearing on the contributions of these two cues to speaker identifi­ cation. The existing literature on aural speaker identification has been con­ cerned mainly with quantifying the extent of the effect. For example, Compton [1963] used only sustained productions of the vowel /i/ and varied duration and filtering conditions. In agreement with Pollack et al. [1957], Compton found that performance increased with duration only up to durations of 1,250 msec. Moreover, he found that high-pass filtering at 1,020 Hz substantially reduced speaker identification per­ formance, and low-pass filtering at 1,020 Hz had no significant effect

186

L a R iv ie r e

Contributions of Fundamental Frequency

on performance. Compton concluded that confusion among speakers was largely explained by similarities of their fundamental frequencies, yet he reported no attempt to account for the confusion on any other basis. Bricker and Pruzansky [1966] used a variety of speech materials as stimuli in a speaker identification task. They noted that confusion among speakers was not independent of the vowel uttered - i. e. the pattern of speaker confusion differed over vowels. Stevens et al. [1968] also noted vowel effects; specifically, they found that utterances con­ taining a front vowel (/i/) yielded higher speaker identification scores than utterances containing a back vowel (/a/). They speculated that this latter result may have been due to the importance of the second formant as a cue to speaker identification, since front vowels arc characterized by a wide frequency gap between the first and second formant and a high absolute frequency location of the second formant. The present experiment attempted to ascertain the contributions of fundamental frequency and formant frequencies to oral speaker identification. Listeners were asked to identify speakers on the basis of (1) isolated voiced vowel, which of course contained both cues, (2) the same vowels low-pass filtered at 200 Hz - an attempt to simulate a fundamental-frequency-only condition, and (3) the same vowels, whispered - an attempt to simulate a formant-frequency-only condi­ tion. It should be noted that conditions 2 and 3 are only approximate simulations. Because articulatory configurations for each vowel can differentially affect laryngeal airflow, low-pass filtering at 200 Hz does not completely separate a source characteristic (fundamental frequen­ cy) from a transfer function characteristic (formant frequency). Whispered productions (condition 3) do have a source component, specifically, broad band frication. However, there is evidence [M eyerE ppler, 1957] that such a source has little effect on the first two pole frequencies.

Subjects. The speakers in this experiment were eight males whose age ranged from 22 to 35. They were free from any speech defect and spoke general American English. The listeners were twelve individuals who had been in routine contact with each of the speakers for a period of at least 6 months. Materials and recording conditions. Because of the observations of Stevens et al. [1968] and Bricker and P ruzansky [1966], two front vowels, /i/ and/a;/, and two back vowels, /u/ and /a/, were chosen for speaker production. These particular vowels were chosen because they

Downloaded by: Monash University 130.194.20.173 - 9/16/2017 2:15:52 PM

Procedure

187

occupy extreme positions in the traditional vowel diagram; hence their formant frequencies were expected to show wide contrasts over a range of speakers. Each speaker was seated in a sound-treated room and positioned approximately 6 in from a high-quality dynamic microphone. All utterances were recorded at 7.5 in/sec on a single-track tape recorder located outside the room. The speakers were asked to produce the following utterances; (a) name; (b) five 3-sec productions of each vowel, whispered; and (c) five 3-sec productions of each vowel, voiced. Speakers were instructed to achieve a constant VU meter deflection (-2) for all of their utterances; performance was monitored by both speaker and experimenter. At the time of the recordings a black-and-white photograph was taken of each speaker against a flat background with a high-quality 35-mm camera. 8- x 10-in enlargements of these photographs were obtained for use in the listening sessions. Utterance selection and treatment. The most phonetically representative production of each stimulus from the five original productions was selected by a panel of four experienced listeners. It was these ‘preferred’ productions which were treated for duration, filtered where appropriate, and repeated and randomized for the actual experimental tapes. One duration, 1,250 msec, was used for all stimulus materials. This duration was selected on the basis of the observations of C ompton [1963] and P ollack et al. [1954] that durations exceeding this had no effect on identification performance. These excerpts were generated by an electronic switch and timer. Rise-fall times were set at 25 msec. To generate the filtered vowels, the 1,250 msec excerpts of the voiced vowels were low-pass filtered at 200 Hz. The frequency response of the filter showed an attenuation rate of 23 dB/octave. Three experimental tapes, one for each stimulus category (i. e. voiced, whispered, and filtered vowels) were generated. On each tape all stimuli were repeated five times and randomized; the interstimulus interval was 5 sec. Playback time for the tapes was approxi­ mately 15 min each. Playback conditions. All experimental tapes were played from a single-track tape recorder, through one channel of a preamplifier and power amplifier. The loudspeaker was located in a sound-treated room. Stimuli were presented to listeners at 70 dB SPL. One or two observ­ ers per listening session were used, seated equidistantly (3 ft) from the loudspeaker. A soundlevel meter, positioned where listeners were to be seated ,was used as a calibration device. Calibration was accomplished via a 1,000 Hz tone which was recorded at the same VU level (-2) as the speech samples. The RMS voltage at the loudspeaker’s input, corresponding to 70 dB SPL in the room, was noted on a vacuum tube voltmeter; the latter was monitored by the experimenter throughout the listening sessions. Portraits of each speaker were attached to the wall of the listening room. Immediately below each portrait, the initials of the individual portrayed were printed in large block letters. Listeners were provided with written instructions. The experimental tapes were presented in random order. All listening sessions were conducted over a period of 5 days. All listeners indicated their responses by circling the initials of the speaker that produced each item. Listener responses were transferred to con­ fusion matrices for each type of utterance. The diagonals of these matrices represented correct listener responses and were used in all analyses of variance. Acoustic analysis. Since one of the objectives was to assess to what extent speaker identifi­ cation and confusion among speakers may be accounted for by fundamental frequency and formant frequencies, each speaker’s ‘preferred’ utterances were subjected to acoustical analyses for extraction of these characteristics. Fundamental frequency for all voiced stimuli was determined by oscillographic analysis (at 15 in/sec). Formant frequencies were estimated from wideband spectrographs generated by a spectrographic unit. Analysis of confusions. For the voiced vowels an attempt was made to account for the ob­ served confusion among speakers on the basis of rank order correlation techniques [Kendall’s tou—Siegel, 1956] relating confusions to the characteristics extracted from acoustic analyses

Downloaded by: Monash University 130.194.20.173 - 9/16/2017 2:15:52 PM

and Formant Frequencies to Speaker Identification

188

LaR iviere Contributions of Fundamental Frequency

above. The rationale here was that if, for a given utterance type, speaker identification is largely coded in some acoustic characteristic, then that parameter should be a good predictor of actual confusion among speakers. Confusion was related to fundamental frequency (f0), first formant frequency (FI), second formant frequency (F2), third formant frequency (F3), and the ratio of formant two to formant one frequency (F2/F1). The confusion matrices themselves are shown in Appendix A.

Results and Discussion Overall listener performance for vowel stimuli is shown in table I. The overall means associated with each type of stimulus in table I are 40% for voiced vowels, 22% for whispered vowels, and 21 % for filter­ ed vowels. It is clear that speaker identification can be achieved on the basis of fundamental frequency information (filtered vowels) and/or formant frequency information (whispered vowels). Furthermore, the relative contributions of these cues are approximately equal. Voiced vowels. Since analysis of variance showed a significant vowelspeaker interaction for the voiced vowels (F21i341 = 3,37, p /u /

4

5

6

7

8

M>M

/¡/> /u/ /* /> /u / /a /> /u /

M > l u/

/æ/> /ul /a /> /u /

/æ/> /il /a/> /¡/

1 Vowels to the left, at any given speaker, resulted in significantly better identification performance than vowels at the right. 2 No significant differences among voiced vowels were found at speakers 1 and 3.

Table III. A posteriori comparisons among whispered vowels1 At speaker2 2 /æ />/i/ /æ />/u/

3

4

/æ />/a/

/a /> /u / /a/> /i/ /a/> /æ /

Whispered vowels. The results of analysis of variance procedures for the whispered vowels demonstrated significant differences among speakers at all vowels but /u/. For three speakers, there were significant differences between the performances yielded by vowels. Table III summarizes the results of a posteriori comparisons among whispered vowels and indicates the same trends observed for differences among voiced vowels - namely, that low vowels, /a/ and /se/, yield better identification scores than high vowels. Filtered vowels. No significant differences among vowels were found for these stimuli (F3 21 = 1.47, p

1 2 3 4 5 6 7 8

Actual speaker 1 2 3

Downloaded by: Monash University 130.194.20.173 - 9/16/2017 2:15:52 PM

Perceived speaker

195

and Formant Frequencies to Speaker Identification Confusion among speakers for voiced /a/ Perceived speaker

1 2 3 4 5 6 7 8

Actual speaker 1 3 2 31 3 6 6 1 11 1 1

3 18 4 2 19 10 0 4

6 5 20 5 8 10 5 2

7

8

4

5

6

3 1 5 32 0 1 1 17

8 2 3 0 46 0 1 0

10 7 11 0 5 24 1 2

4 2 1 2 0 0 31 1

4 2 1 4 6 16 1 26

Confusion among speakers for whispered /i/ Perceived speaker

1 2 3 4 5 6 7 8

Actual speaker 1 2 3

4

5

6

7

8

15 9 8 11 3 4 6 4

3 5 10 6 18 8 4 6

6 0 13 10 9 11 4 7

10 0 17 9 7 12 1 4

12 7 3 7 7 11 12 1

3 1 8 7 13 15 4 9

11 23 7 8 3 3 2 3

1 4 11 11 9 15 5 4

Perceived speaker

1 2 3 4 5 6 7 8

Actual speaker 1 3 2

4

5

6

7

8

15 5 5 6 2 4 12 11

9 6 5 15 5 4 10 6

7 2 7 10 8 14 8 4

2 0 14 3 7 15 8 11

13 2 15 7 7 5 7 4

3 3 7 5 9 13 11 9

7 17 11 4 5 7 4 5

11 4 7 3 1 8 9 17

Downloaded by: Monash University 130.194.20.173 - 9/16/2017 2:15:52 PM

Confusion among speakers for whispered /u/

196

L a R iv ie r e

Contributions of Fundamental Frequency

Confusion among speakers for whispered /ac/ Perceived speaker

1 2 3 4 5 6 7 8

Actual speaker 1 2 3 15 2 12 6 6 14 1 4

10 41 0 3 1 0 1 4

11 7 6 4 8 12 5 7

4

5

6

7

8

11 2 5 6 18 10 5 3

14 16 6 9 1 3 4 7

8 5 3 3 13 17 4 7

8 14 5 6 3 7 15 2

12 4 18 7 6 12 1 0

Confusion among speakers for whispered /a/ Perceived speaker

1 2 3 4 5 6 7 8

Actual speaker 1 2 3 10 3 12 9 9 7 3 7

19 31 1 0 3 4 1 1

11 2 19 2 4 12 5 5

4

5

6

7

8

7 8 3 31 3 2 3 3

9 10 8 4 5 6 8 10

6 1 19 3 3 19 1 8

10 2 13 4 7 6 8 10

4 3 11 8 8 20 1 5

Confusion among speakers for filtered /i/

6

11 4 6 10 0 10 9 10

4 8 6 15 i 12 8 6

5 14 3 2 21 5 1 9

11 4 10 7 2 5 11 10

3 17 5 5 7 6 4 13

12 6 10 12 0 8 7 5

7 11 6 3 8 4 11 11 6

©

5

*—* CO M

4

CO ©

1 2 3 4 5 6 7 8

Actual speaker 1 2 3

Downloaded by: Monash University 130.194.20.173 - 9/16/2017 2:15:52 PM

Perceived speaker

and Formant Frequencies to Speaker Identification

197

Confusion among speakers for filtered /u/ Perceived speaker

1 2 3 4 5 6 7 8

Actual speaker 1 2 3

4

5

6

7

8

7 3 6 9 5 5 15 10

7 5 8 9 6 4 16 5

3 11 5 4 20 8 4 5

9 9 5 7 0 13 15 2

7 4 11 4 2 10 11 11

6 9 7 3 11 5 6 13

5 7 13 6 5 10 6 8

12 9 7 3 5 10 9 5

Confusion among speakers for filtered /a1/ Perceived speaker

1 2 3 4 5 6 7 8

Actual speaker 1 2 3

4

5

6

7

8

4 5 4 20 1 0 17 9

10 8 3 11 3 8 13 4

1 13 2 1 28 0 2 13

11 5 11 7 1 6 8 11

8 8 7 7 1 5 19 5

17 13 6 8 0 5 3 8

3 17 8 3 14 3 2 10

6 10 10 5 6 9 3 11

Perceived speaker

1 2 3 4 5 6 7 8

Actual speaker 1 2 3 7 13 2 11 1 3 19 4

3 20 3 7 7 5 0 15

8 6 9 3 8 12 3 11

4

5

6

7

8

10 7 6 12 3 4 11 7

4 14 3 3 24 2 0 10

13 8 10 2 5 5 4 13

10 6 10 12 2 4 12 4

5 12 10 2 7 6 1 17

Request reprints from: C onrad L aR iviere, Speech Science Laboratory, University of Missouri-Kansas City, Kansas City, MO 64110 (USA)

Downloaded by: Monash University 130.194.20.173 - 9/16/2017 2:15:52 PM

Confusion among speakers for filtered /a/

Contributions of fundamental frequency and formant frequencies to speaker identification.

Phonetica 31: 185-197 (1975) Contributions of Fundamental Frequency and Formant Frequencies to Speaker Identification C onrad LaR iviere Speech Scien...
1MB Sizes 0 Downloads 0 Views