The Importance of Acoustic Temporal Fine Structure Cues in Different Spectral Regions for Mandarin Sentence Recognition Bei Li,1 Hui Wang,1 Guang Yang,1 Limin Hou,2 Kaiming Su,1 Yanmei Feng,1 and Shankai Yin1 Objectives: To study the relative contribution of acoustic temporal fine structure (TFS) cues in low-, mid-, and high-frequency regions to Mandarin sentence recognition.

functionally equivalent across different frequency regions. Most results suggest that the mid-frequency region is more important than the low- and high-frequency regions (Ardoint & Lorenzi 2010; Fogerty 2011a, 2011b). Specifically, Ardoint and Lorenzi (2010) found that the most important acoustic TFS cues for speech identification were at 1 to 2 kHz. Recognition of speech based on acoustic TFS cues decreased significantly when the stimuli were high-pass filtered above 2.5 kHz. Fogerty (2011a, 2011b) reported that the mid-frequency region (528 to 1941 Hz) was weighted most heavily for acoustic TFS cues in a quiet environment. Hopkins and Moore (2010) showed that acoustic TFS information was important across the frequency spectrum, but they did not calculate the relative importance of acoustic TFS cues across different frequency regions. The differences among studies may result from the methods used to evaluate the relative importance of various frequency regions and from the speech materials used in the studies. Mandarin, which is spoken by more people than any other language in the world, is a tonal language. Words that have the same segmental phonetic content but differ in tone have different meanings. Previous studies have shown that lexical tones are distinguished primarily by a change in fundamental frequency (F0) although other acoustic cues, such as syllable duration and amplitude envelope, are also involved in tone recognition (Whalen & Xu 1992; Fu et al. 1998). For Mandarin speech, E cues covary with F0 (Kong & Zeng 2006). The role of E cues for tone recognition has been explored in previous studies (Whalen & Xu 1992; Lin 1988; Fu & Zeng 2000). However, the role of TFS cues seems particularly important for tone recognition. The relative contributions of E and TFS to Mandarin tone recognition were examined by Xu and Pfingst (2003) and later confirmed by Wang et al. (2011). The results showed that the lexical tone recognition in normal-hearing, native, Mandarin-speaking listeners was based on TFS regardless of E cues (Xu et al. 2002; Xu & Pfingst 2003; Xu & Zhou 2011). Other studies showed that pitch perception was dominated by TFS rather than by E cues and that TFS is critical for voice pitch perception (Zeng et al. 2004; Kong & Zeng 2006) and tone recognition in noise (Xing et al. 2012). However, the question of which frequency regions are the most important with regard to TFS cues for speech perception in tonal languages remains unresolved. In view of the importance of TFS for pitch perception and tone recognition, which is mainly dependent on low-frequency information, it is expected that the acoustic TFS information from the low-frequency region would be especially important for tonal languages, more than for nontonal languages. To test this hypothesis, acoustic TFS was extracted and assigned to low-, mid-, and high-frequency regions, allowing the relative importance of acoustic TFS from different frequency regions to be evaluated for Mandarin sentence recognition.

Design: Twenty-one subjects with normal hearing were involved in a study of Mandarin sentence recognition using acoustic TFS. The acoustic TFS information was extracted from 10 3-equivalent rectangular bandwidthwide bands within the range 80 to 8858 Hz using the Hilbert transform and was assigned to low-, mid-, and high-frequency regions. Percent-correct recognition scores were obtained with acoustic TFS information presented using one, two, or three frequency regions. The relative weights of the three frequency regions were calculated using the least-squares approach. Results: Results indicated that the mean percent-correct scores for sentence recognition using acoustic TFS were nearly perfect for stimuli with all three frequency regions together. Recognition was approximately 50 to 60% correct with only the low- or mid-frequency region but decreased to approximately 5% correct with only the high-frequency region of acoustic TFS. The mean weights of the low-, mid-, and highfrequency regions were 0.39, 0.48, and 0.13, respectively, and the difference between each pair of frequency regions was statistically significant. Conclusion: The acoustic TFS cues in low- and mid-frequency regions convey greater information for Mandarin sentence recognition, whereas those in the high-frequency region have little effect. Key words: Mandarin Chinese, Perceptual weight, Speech perception, Temporal fine structure. (Ear & Hearing 2016;37;e52–e56)

INTRODUCTION Speech may be viewed acoustically as the product of a temporal envelope (E) and a temporal fine structure (TFS) (Smith et al. 2002; Zeng et al. 2005). The acoustic E contains temporal modulation information, and the acoustic TFS contains the instantaneous phase information in the signal (Kong & Zeng 2006; Hopkins et al. 2008). Previous studies have assessed the role of the acoustic E and TFS in speech perception (Drullman 1995; Shannon et al. 1995; Xu et al. 2002; Xu & Pfingst 2003; Zeng et al. 2004; Moore 2008; Hopkins & Moore 2009). Overall, the results suggest that the E is most important for speech perception in quiet environment and that the acoustic TFS is critical for pitch and music perception and for speech perception in noisy environments (Smith et al. 2002; Lorenzi et al. 2006; Hopkins & Moore 2009). More recently, Apoux and Healy (2013) have suggested that TFS is primarily used as a grouping cue to select the target speech signal in noise. Although the importance of TFS information has been established in these studies, the acoustic TFS does not appear to be Department of Otolaryngology, Shanghai Jiao Tong University Affiliated Sixth People’s Hospital, Shanghai, China; and 2School of Communication and Information Engineering, Shanghai University, Shanghai, China. 1

0196/0202/2016/371-0e52/0 • Ear & Hearing • Copyright © 2015 Wolters Kluwer Health, Inc. All rights reserved • Printed in the U.S.A. e52

Copyright © 2015 Wolters Kluwer Health, Inc. Unauthorized reproduction of this article is prohibited.



e53

LI ET AL. / EAR & HEARING, VOL. 37, NO. 1, e52–e56

TABLE 1.  Cutoff frequencies for each band and frequency region Frequency Region Low-frequency region Mid-frequency region High-frequency region

Band

Cutoff Frequency (Hz)

Band 1 Band 2 Band 3 Band 4 Band 5 Band 6 Band 7 Band 8 Band 9 Band 10

80–205 205–372 372–596 596–899 899–1315 1315–1893 1893–2716 2716–3924 3924–5782 5782–8858

three frequency regions (one condition listed in Table 2) were presented to subjects. Table 2 lists all of the conditions tested. All programs were run using MATLAB (version 7.0) software.

Procedures

MATERIALS AND METHODS Subjects Twenty-one listeners with normal audiometric thresholds (

The Importance of Acoustic Temporal Fine Structure Cues in Different Spectral Regions for Mandarin Sentence Recognition.

To study the relative contribution of acoustic temporal fine structure (TFS) cues in low-, mid-, and high-frequency regions to Mandarin sentence recog...
1KB Sizes 0 Downloads 6 Views