Daniel Fogerty and Jenine L. Entwistle: JASA Express Letters [http://dx.doi.org/10.1121/1.4935079] Published Online 6 November 2015

Level considerations for chimeric processing: Temporal envelope and fine structure contributions to speech intelligibility Daniel Fogertya) and Jenine L. Entwistle Department of Communication Sciences and Disorders, University of South Carolina, Columbia, South Carolina 29208, USA [email protected], [email protected]

Abstract: Chimeric processing is used to assess the respective role of the acoustic temporal envelope (ENV) and the temporal fine structure (TFS) by adding noise to either component. An acoustic analysis demonstrates that adding noise to the ENV results in noise degradation of the ENV and overall signal attenuation, whereas adding noise to the TFS results in only noise degradation of the TFS. Young normal hearing adults were then tested using a modified chimeric strategy to maintain speech levels. Results partially confirm the primary role of the ENV in determining speech intelligibility but demonstrate significant TFS contributions during selective ENV masking. C 2015 Acoustical Society of America V

[DOS] Date Received: June 17, 2015

Date Accepted: October 22, 2015

1. Introduction Research has established the primary importance of the acoustic temporal envelope (ENV) for speech intelligibility (e.g., Shannon et al., 1995; Fogerty, 2011). Several studies also showed that speech intelligibility is greatly reduced when ENV cues are limited, such as through interference by fluctuating maskers (e.g., Nelson et al., 2003). In such conditions, it has been proposed that the frequency-modulated carrier of the signal, the acoustic temporal fine structure (TFS), may provide additional information for speech intelligibility by facilitating the glimpsing of spectro-temporal moments at favorable signal-to-noise ratios (SNR), that is, by listening in the amplitude dips of the interfering masker (Lorenzi et al., 2006; Hopkins and Moore, 2009). Recent investigations have suggested that the availability of the speech TFS during these moments of favorable SNRs may be particularly important for source segregation (Apoux et al., 2013). TFS may also be important in some intelligibility metrics (Chen et al., 2013) and play an important role in the integration of speech fragments (Gilbert and Lorenzi, 2010; Gnansia et al., 2010). Indeed, listeners do appear to place significant perceptual weight on TFS cues when the ENV is significantly degraded by noise (Fogerty, 2011). Therefore TFS may play an important role in noisy listening contexts; however, the role and extent of this contribution is still debated. Recently, Apoux et al. (2013) presented a very instructive framework for viewing the role and contribution of the TFS to speech intelligibility, particularly in marrying the findings of previous vocoder- and chimeric-based signal processing strategies. Vocoderbased methods extract the ENV through rectification and low-pass filtering and then modulate an independent noise carrier. However, the noise carrier is typically used for both noise vocoding the target as well as for masking, resulting in a sound segregation problem independent of the listener’s ability to process TFS cues (Apoux et al., 2013). Chimeric-based methods use the Hilbert transform to extract ENV and TFS components from clean (i.e., quiet) and noisy samples of the same speech signal. The signal processing employed by these studies is crucial to understanding the potential role of the TFS. The present study was designed as a further analysis of the signal processing and potential implications for chimeric-based processing strategies. The goal of this study is therefore to (1) determine the acoustic and perceptual implications of chimeric processing strategies and possible refinements and (2) discuss the potential implications of this signal processing. 2. Acoustic analysis of chimeric processing using the same speech sample Chimeric processing was first proposed by Smith et al. (2002) to use the Hilbert transform to combine the ENV from one speech sample and the TFS from a second a)

Author to whom correspondence should be addressed.

J. Acoust. Soc. Am. 138 (5), November 2015

C 2015 Acoustical Society of America EL459 V

Daniel Fogerty and Jenine L. Entwistle: JASA Express Letters [http://dx.doi.org/10.1121/1.4935079] Published Online 6 November 2015

independent speech sample. As such, the combined stimulus presents conflicting speech cues to the listener. A modification of this method uses the Hilbert transform to extract the ENV from the speech signal at one SNR and the TFS from the same speech signal at a different SNR. These components are then combined to form the final stimulus that independently degrades the ENV and TFS components of the original speech signal. Importantly, this modified processing method has the potential to restore the full speech stimulus with noise differentially added to either component. This chimeric processing strategy was initially implemented by Fogerty (2011) [Smith et al. (2002) used ENV and TFS components from different signals] and offers several significant advantages to vocoder-based techniques in applications investigating the relative contribution of ENV and TFS cues. (1) Envelope and TFS components can be independently degraded, (2) components are degraded according to the SNR at which the noise is added to the original speech stimulus (i.e., the SNR is an ecologically meaningful scale of stimulus degradation), (3) two original carriers are preserved for the noise and speech signal, allowing for TFS cues to promote segregation between competing sound sources, and (4) the full ENV remains preserved but is masked by noise, reducing the potential contributions from ENV recovery based on the acoustic TFS (Apoux et al., 2013). The signal processing implemented by Apoux et al. (2013) and Fogerty (2011) differ in two important ways. The first difference, not investigated here, was the number of analysis bands. The second difference under investigation in the present study was that Apoux et al. (2013) normalized the overall output of the combined speech and masker mixture to 65 dBA to have the capability of presenting extreme SNR values (e.g., 1000 dB SNR). In contrast, Fogerty (2011) ensured that stimuli were calibrated to present the processed speech at the original speech level. This difference could have significant implications for the interpretation of the results because adding noise to only the ENV results in a greater change in the combined mixture intensity than adding the noise to only the TFS component. Thus equating the combined speech and masker mixture levels results in an overall signal attenuation that is fundamentally different for ENV-masked compared to TFS-masked stimuli. This correlated change in signal level may have important consequences when extending results to listeners with reduced audibility (i.e., hearing loss). To demonstrate this effect, the stimuli used by Apoux et al. (2013) were recreated and the signal attenuation factors were calculated. The purpose of this analysis is to highlight potential level effects for chimeric-processed speech where ENV and TFS components are extracted from the noisy signal. 2.1 Methods Fifty sentences from the Speech Perception in Noise (SPIN; Kalikow et al., 1977) test, including equal numbers of high and low predictability sentences were equated for overall root mean square (rms) level and selected for this analysis. A simplified speech spectrum-shaped noise was created modeling the one used by Apoux et al. (2013) (constant spectrum level below 800 Hz with a 6 dB/octave roll-off above 800 Hz). Sentences were processed using the chimeric paradigm as implemented by Apoux et al. (2013). The masker and target mixture was filtered into 30 contiguous frequency bands, each one an equivalent rectangular bandwidth (ERBN, Glasberg and Moore, 1990). The ENV and TFS were extracted from each band using the Hilbert transform. The ENV extracted from one band was then combined with the TFS from that same band that had been mixed with the noise at a different SNR. The final stimulus was then summed across frequency bands. The combined stimulus resulted in restoration of the full speech signal with the ENV masked at one SNR (SNRenv) and the TFS masked at a different SNR by the same noise source (SNRtfs). The rms level of the combined stimulus was then compared to the rms of the original sentence prior to noise masking. The difference between these two measures defines the amount of attenuation that was applied by the signal processing as a result of the final scaling of the combined stimulus. This attenuation processing is in contrast to that used by Fogerty (2011), who ensured that the original speech level was maintained for all conditions with the level of the added noise allowed to vary. 2.2 Results and discussion Figure 1(a) displays the resulting signal attenuation values for the different conditions tested by Apoux et al. (2013) for using speech spectrum-shaped noise. First, it is notable that the attenuation level is largely dependent upon the ENV noise level (r ¼ 0.95, p < 0.001). The attenuation level increases as a result of decreasing the SNRenv. The consequence of this processing results in attenuating both ENV and TFS components. EL460 J. Acoust. Soc. Am. 138 (5), November 2015

Daniel Fogerty and Jenine L. Entwistle

Daniel Fogerty and Jenine L. Entwistle: JASA Express Letters [http://dx.doi.org/10.1121/1.4935079] Published Online 6 November 2015

Fig. 1. (Color online) (a) Level of the processed stimuli compared to the level of the original sentence before adding the noise. Values along the ordinate indicate the amount of attenuation applied to the final stimulus by Apoux et al. (2013) as a function of the SNR of the ENV (left panel) and as a function of the SNR of the TFS (right panel) with SNRtfs and SNRenv as the parameter, respectively. Results strongly demonstrate that SNRenv determines the final signal attenuation. (b) Performance is plotted across SNR. Solid lines indicate processing conditions that maintained the same speech level, the dotted line (red) is the corresponding reference function (REF) where ENV and TFS were equally masked at the same SNR, and the dashed lines (blue) indicate performance using stimuli that equated overall presentation level in which the speech level is attenuated (from Apoux et al., 2013).

In contrast, the TFS noise level was not associated with the amount of attenuation (r ¼ 0.008). To assess the relative contribution of SNRenv, SNRtfs, and attenuation, these factors were entered in as predictors in a step-wise linear regression analysis with listener performance as the dependent variable. Mean listener performance was first estimated from Fig. 3 of Apoux et al. (2013) and converted into rationalized arcsine units (RAU; Studebaker, 1985). Results demonstrated that the signal attenuation value best predicted mean listener performance, accounting for 76% of the variance [F(1,49) ¼ 155.1, p < 0.001, b ¼ 0.87], with SNRtfs accounting for an additional 12% of the variance [F(2,49) ¼ 179.0, p < 0.001, b ¼ 0.35]. These results indicate that while ENV processing accounted for the primary variance across conditions, availability of the TFS still significantly accounted for additional variance. Importantly, the variance accounted for by TFS was independent of signal attenuation and not accounted for by ENV reconstruction. This was in addition to any contributions from the ENV. Given that the TFS signal was also attenuated based on the ENV processing, it is possible that TFS may play an even greater role in stimuli that maintain the speech level (i.e., preserving the original TFS level). The purpose of experiment 1 was to behaviorally examine performance when speech level is maintained. 3. Performance with chimeric processing maintaining speech level 3.1 Methods 3.1.1 Listeners Twenty young normal hearing listeners (18–26 yr) were paid an hourly rate or received course credit for participation. All participants had pure-tone air-conduction thresholds at 20 dB hearing level (HL) or better at octave frequencies ranging from 250 to 8000 Hz. One participant was removed from analysis after discovering she had previously participated in the Apoux et al. (2013) study. 3.1.2 Stimuli and design The target stimuli consisted of 187 SPIN sentences. Each sentence contained a unique final keyword without repetition. The sentences were sorted into 11 experimental blocks containing 16 SPIN sentences. Each block contained eight low-predictability sentences and eight high-predictability sentences. All stimuli were processed according to the method defined in Sec. 2.1 that preserved speech levels. This ensured that the reconstructed speech level was maintained at 70 dB sound pressure level (SPL) with additional energy supplied by the noise component added during processing. Processed stimuli were created using the chimera processing method with unique SNRenv and SNRtfs levels to investigate relative contribution of the TFS and ENV. SNRenv was examined at two levels (6 or 6 dB SNR) with the SNRtfs set at 6, 0, or 6 dB SNR. Likewise, SNRtfs was set to 6 or 6 dB SNR with SNRenv set to 6, 0, or 6 dB SNR. The overlapping conditions (when SNRenv and SNRtfs were equal at 6 or 6 dB SNR) J. Acoust. Soc. Am. 138 (5), November 2015

Daniel Fogerty and Jenine L. Entwistle EL461

Daniel Fogerty and Jenine L. Entwistle: JASA Express Letters [http://dx.doi.org/10.1121/1.4935079] Published Online 6 November 2015

were only tested once, resulting in a total of ten conditions. As a reference, an 11th condition tested SNRenv and SNRtfs both set to 0 dB SNR. These later three conditions are functionally similar to adding speech to a concurrently presented speech-shaped noise. This experimental design replicates critical conditions from Apoux et al. (2013). 3.1.3 Procedure Participants were tested in a sound attenuating booth for all experimental conditions. Stimuli were presented monaurally via a Sennheiser HD 280 Pro headphone presented to the right ear. Participants were instructed to type the final keyword heard in each sentence. A short demo of 11 sentences was provided corresponding to the experimental test conditions. The demo sentences were not used during formal testing, and no keywords were repeated within and across conditions. After practice, each participant completed the eleven experimental blocks presented in random order. The participants heard each sentence only once. 3.2 Results and discussion Performance for listeners in this experiment is displayed in Fig. 1(b) as the solid (for different SNRenv and SNRtfs values) and dotted lines (for equal SNRenv and SNRtfs values). The grey dashed lines provide the corresponding data from Apoux et al. (2013) for comparison. A repeated-measures analysis of variance (ANOVA) was performed to analyze the two levels of SNRenv (þ6 and 6 dB) at three levels of the SNRtfs (þ6 dB, 0 dB, 6 dB). This analysis demonstrated a main effect for level of SNRenv [F(1,19) ¼ 530.191, p < 0.05]. As expected, participants did better when the ENV was preserved at þ6 dB. The analysis also demonstrated a main effect for level of SNRtfs across all three levels [F(2,38) ¼ 131.026, p < 0.05]. Overall, better performance was observed when the TFS was more preserved at þ6 dB SNR. There was a significant interaction between the SNRenv at the varying levels of SNRtfs [F(2,38) ¼ 84.925, p < 0.05]. When the ENV was preserved at þ6 dB, participants performed similarly regardless of the preservation of the TFS (p > 0.05). However, there was a significant difference overall when the ENV was distorted at 6 dB at all levels of the TFS SNR. That is, for this unfavorable SNRenv, listeners did systematically better with increasing SNRtfs [6 vs 0 dB: t(19) ¼ 11.658, p < 0.0125, d ¼ 0.88; 0 vs þ6 dB: t(19) ¼ 5.532, p < 0.0125, d ¼ 0.54]. A second two (SNRtfs: þ6 and 6 dB) by three (SNRenv: þ6 dB, 0 dB, and 6 dB) repeated-measures analysis of variance (ANOVA) demonstrated significant main effects for SNRtfs [F(1,19) ¼ 96.274, p < 0.05] and SNRenv [F(2,38) ¼ 346.282, p < 0.05]. Overall, participants performed better when TFS was more preserved at þ6 dB. Participants also scored a higher percentage of keyword accuracy as the SNRenv became increasingly preserved (from 6 to þ6 dB). Importantly, the analysis also demonstrated a significant interaction between the SNRtfs at varying levels of the SNRenv [F(2,38) ¼ 38.474, p < 0.05]. Participants did not perform significantly differently at 0 and 6 dB SNRenv for both levels of SNRtfs (þ6 and 6 dB), Bonferroniadjusted p > 0.05. However, a significant difference in performance was noted between the two SNRtfs levels with a 6 dB SNRenv, [t(19) ¼ 8.743, p < 0.05, d ¼ 0.941] in Fig. 1(b), which supports a greater contribution of the TFS when the ENV is degraded. Overall, results show that listeners primarily utilize ENV cues when preserved at a favorable SNR. Under these conditions, listener performance does not appear to be significantly impacted by the degree of TFS preservation. However, performance is significantly related to the degree of TFS preservation when ENV cues are degraded. These results for speech level controlled stimuli are consistent with the previous data provided by Apoux et al. (2013) that implemented final stimulus scaling based on the combined speech and noise levels. The main difference in our findings is that our listeners performed an average of 15 percentage points better than those in the previous study, possibly due to the higher average speech level presented in our study, across conditions, which was independent of noise level. However, the significant observation from this data and that of Apoux et al. (2013) is the interaction between SNRenv and SNRtfs, where the contribution of the TFS occurs when ENV cues are limited. This is consistent with the reporting of Apoux et al. (2013) of large effect sizes for the TFS, but only in those restricted cases where the ENV was presented at SNRs below 0 dB. 4. General discussion The acoustic analysis presented here demonstrates that normalizing the combined signal level of the speech and noise mixture introduces speech level attenuation that is correlated with the preservation of the ENV (SNRenv). Testing with young normal EL462 J. Acoust. Soc. Am. 138 (5), November 2015

Daniel Fogerty and Jenine L. Entwistle

Daniel Fogerty and Jenine L. Entwistle: JASA Express Letters [http://dx.doi.org/10.1121/1.4935079] Published Online 6 November 2015

hearing adults with processing that maintained original speech levels demonstrates higher overall performance across most conditions compared to using normalized combined signal levels. However, similar performance trends across the SNRs tested were observed. This may indicate a potentially minor role of attenuation in determining relative ENV and TFS contributions. However, effects of correlated signal attenuation may have potentially significant consequences when extending investigations to listeners who have hearing loss where audibility of the signal could potentially interact with any attenuation. When processing steps are specifically taken to ensure that the level of the speech is presented at the same level, then methods can be extended to older listeners with hearing loss using controls for audibility (Fogerty and Humes, 2012). The end result of overall level normalization produces differential effects for ENV and TFS processing. Adding noise to the ENV results in two different signal distortions: noise degradation of the ENV and overall signal attenuation. In contrast, adding noise to the TFS results in only one type of distortion: noise degradation of the TFS. Overall, the behavioral results suggest that when contributions from the ENV are minimized through the noise masking, listener performance is determined by the availability of TFS cues. At a more favorable SNRenv, listener performance is more dependent on the ENV with little perceptual impact from the SNRtfs level. For high levels of ENV distortion, an improvement of more than 20 percentage points can be observed when the TFS is preserved (i.e., greater than 0 dB SNR). These results indicate that preservation of the TFS results in significant contributions to speech intelligibility when access to the ENV is limited due to noise distortion. Exactly how the TFS contributes to intelligibility in these cases is not clear. One possible candidate is that the time-varying frequency and harmonic cues provided by the TFS facilitate perceptual segregation of the speech ENV from the competing modulations of the masking noise, thereby enabling listeners to better encode the ENV information present in the signal. However, further investigations will need to examine the underlying source of the TFS contribution. These results are highly consistent with current theories of ENV and TFS processing. Temporal ENV cues are of primary importance. However, for everyday listening, fluctuating maskers frequently interfere with processing speech ENV modulations (e.g., Nelson et al., 2003), which may also be extended to the random amplitude modulations of steady-state noise that may also impose modulation masking (Stone et al., 2012). Under such conditions of limited ENV resolution, listeners may significantly rely on TFS cues to facilitate segregation of the target ENV from competing amplitude modulated signals. Acknowledgments This work was supported, in part, by National Institutes of Health Grant No. NIDCD R03-DC012506. References and links Apoux, F., Yoho, S. E., Youngdahl, C. L., and Healy, E. W. (2013). “Role and relative contribution of temporal envelope and fine structure cues in sentence recognition by normal-hearing listeners,” J. Acoust. Soc. Am. 134, 2205–2212. Chen, F., Wong, L. N., and Hu, Y. (2013). “A Hilbert-fine-structure-derived physical metric for predicting the intelligibility of noise-distorted and noise-suppressed speech,” Speech Commun. 55, 1011–1020. Fogerty, D. (2011). “Perceptual weighting of individual and concurrent cues for sentence intelligibility: Frequency, envelope, and fine structure,” J. Acoust. Soc. Am. 129, 977–988. Fogerty, D., and Humes, L. E. (2012). “A correlational method to concurrently measure envelope and temporal fine structure weights: Effects of age, cochlear pathology, and spectral shaping,” J. Acoust. Soc. Am. 132, 1679–1689. Gilbert, G., and Lorenzi, C. (2010). “Role of spectral and temporal cues in restoring missing speech information,” J. Acoust. Soc. Am. 128, EL294–EL299. Glasberg, B. R., and Moore, B. C. J. (1990). “Derivation of auditory filter shapes from notched-noise data,” Hear. Res. 47, 103–138. Gnansia, D., Pressnitzer, D., Pean, V., Meyer, B., and Lorenzi, C. (2010). “Intelligibility of interrupted and interleaved speech in normal-hearing listeners and cochlear implantees,” Hear. Res. 265, 46–53. Hopkins, K., and Moore, B. C. J. (2009). “The contribution of temporal fine structure to the intelligibility of speech in steady and modulated noise,” J. Acoust. Soc. Am. 125, 442–446. Kalikow, D. N., Stevens, K. N., and Elliott, L. L. (1977). “Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability,” J. Acoust. Soc. Am. 61, 1337–1351. Lorenzi, C., Gilbert, G., Carn, H., Garnier, S., and Moore, B. C. J. (2006). “Speech perception problems of the hearing impaired reflect inability to use temporal fine structure,” Proc. Natl. Acad. Sci. U.S.A. 103, 18866–18869. Nelson, P. B., Jin, S.-H., Carney, A. E., and Nelson, D. A. (2003). “Understanding speech in modulated interference: Cochlear implant users and normal-hearing listeners,” J. Acoust. Soc. Am. 113, 961–968. J. Acoust. Soc. Am. 138 (5), November 2015

Daniel Fogerty and Jenine L. Entwistle EL463

Daniel Fogerty and Jenine L. Entwistle: JASA Express Letters [http://dx.doi.org/10.1121/1.4935079] Published Online 6 November 2015 Shannon, R. V., Zeng, F.-G., Kamath, V., Wygonski, J., and Ekelid, M. (1995). “Speech recognition with primarily temporal cues,” Science 270, 303–304. Smith, Z. M., Delgutte, A. J., and Oxenham, A. J. (2002). “Chimaeric sounds reveal dichotomies in auditory perception,” Nature 416, 87–90. Stone, M. A., F€ ullgrabe, C., and Moore, B. C. (2012). “Notionally steady background noise acts primarily as a modulation masker of speech,” J. Acoust. Soc. Am. 132, 317–326. Studebaker, G. A. (1985). “A rationalized arcsine transform,” J. Speech Lang. Hear. Res. 28, 455–462.

EL464 J. Acoust. Soc. Am. 138 (5), November 2015

Daniel Fogerty and Jenine L. Entwistle

Level considerations for chimeric processing: Temporal envelope and fine structure contributions to speech intelligibility.

Chimeric processing is used to assess the respective role of the acoustic temporal envelope (ENV) and the temporal fine structure (TFS) by adding nois...
NAN Sizes 0 Downloads 5 Views