Journalof Speech and Hearing Research, Volume 33, 298-306, June 1990

ACOUSTIC CORRELATES OF VOCAL QUALITY ACOUSTIC CORRELATES OF VOCAL QUALITY L. ESKENAZI Department of Electrical Engineering, University of Florida

D. G. CHILDERS Thomson CSF, Malakoff, France

D. M. HICKS Communicative Disorders, The Cleveland Clinical Foundation We have investigated the relationship between various voice qualities and several acoustic measures made from the vowel /i/ phonated by subjects with normal voices and patients with vocal disorders. Among the patients (pathological voices), five qualities were investigated: overall severity, hoarseness, breathiness, roughness, and vocal fry. Six acoustic measures were examined. With one exception, all measures were extracted from the residue signal obtained by inverse filtering the speech signal using the linear predictive coding (LPC) technique. A formal listening test was implemented to rate each pathological voice for each vocal quality. A formal listening test also rated overall excellence of the normal voices. A scale of 1-7 was used. Multiple linear regression analysis between the results of the listening test and the various acoustic measures was used with the prediction sums of squares (PRESS) as the selection criteria. Useful prediction equations of order two or less were obtained relating certain acoustic measures and the ratings of pathological voices for each of the five qualities. The two most useful parameters for predicting vocal quality were the Pitch Amplitude (PA) and the Harmonics-to-Noise Ratio (HNR). No acoustic measure could rank the normal voices. KEY WORDS: vocal quality, vocal disorders, acoustic measures, residue signal

Analytic methods and commercial devices exist for measuring certain vocal characteristics, for example, the common measures of fundamental frequency of voicing (Fo), intensity, flow rate and flow volume, and the less common measures of F o jitter perturbations (Davis, 1976; Hillenbrand, 1987; Hollien, Michel, & Doherty, 1973; Horii, 1979; Klinholz & Martin, 1985; Koike & Markel, 1975; Koike, Takahashi, & Calcaterra, 1977; Lieberman, 1961, 1963; Milenkovic, 1987; Smith, Weinberg, Feth, & Horii, 1978; Sorensen & Horii, 1984; Wendahl, 1966), shimmer perturbations (Horii, 1980; Kasuya, Ogawa, Kikushi, & Ebihara, 1986; Klinholz & Martin, 1985; Ludlow, Bassich, Connor, Coulter, & Lee, 1987; Sorensen & Horii, 1984), spectral noise levels and harmonics-to-noise ratio (Kasuya, Masubuchi, Ebihara, & Yoshida, 1986; Kitajima, 1981; Yumoto, Gould, & Baer, 1982; Yumoto, Sasaki, & Okamura, 1984), breathiness (Fukazawa, El-Assuoofy, & Honjo, 1988), long term spectral average (Hammarberg, Fritzell, Gauffin, Sundberg, & Wedin, 1980), and others. From these measures normative ranges for F o and several other measures have been established. However, acoustic profiles of vocal quality are lacking, despite the fact that speech quality has been the focus of considerable research (Childers & Wu, 1990; Colton & Estill, 1981; Eskenazi, 1988; Klatt, 1987; Laver, 1980; Murry & Singh, 1980; Pinto, Childers, & Lalwani, 1989; Quackenbush, Barnwell, & Clements, 1988; Singh & Murry, 1978). Researchers have also attempted to determine those vocal qualities most common and significant. To achieve this goal various quality scales have been used, such as GRBAS in Japan (Imaizumi, 1986) and diagnostic acceptability measure (Voiers, 1977). Numerous studies have examined the possibility of obtaining quantitative information concerning © 1990, American Speech-Language-Hearing Association

vocal fold function and voice quality; however, the performance of the various measures considered has been found to depend on the type of pathology, the speaker's gender, and the condition of analysis (Kasuya, Kobayashi, & Kobayashi, 1983; Kasuya, Masubuchi, Ebihara, & Yoshida, 1986; Kasuya, Ogawa, Kikushi, & Ebihara, 1986; Muta, Muraoka, Wagatsuma, Fukuda, Takayama, Fujioka, & Kanou, 1987). Although a single measure of assessing vocal quality may be ineffective, a combination of measures may prove satisfactory (Prosek, Montgomery, & Hawkins, 1987; Sansone & Emanuel, 1970; Yumoto, Gould, & Baer, 1982). One indirect procedure often used for studying vocal fold vibratory characteristics and their relationship to vocal quality is inverse filtering to obtain an estimate of the volume-velocity waveform (Hillman & Weinberg, 1981; Javkin, Antonanzas-Barroso, & Maddieson, 1987; Koike & Markel, 1975; Markel & Gray, 1976; Rothenberg, 1973; Rothenberg & Zahorian, 1977). Several disadvantages of using this approach are: (a) it is difficult to determine the parameters for the inverse filter model from the speech signal, (b) the volume-velocity waveform obtained by inverse filtering contains only low-frequency information, and (c) the vocal tract is assumed to be a linear model, which may not always be valid for large glottal openings (Koike & Markel, 1975; Rothenberg & Zahorian, 1977). For these reasons researchers have considered alternative procedures, for example the error or residue signal at the output of the inverse filter (Davis, 1976; Prosek et al., 1987). The residue signal has a broad frequency spectrum and for speakers with normal voices the residue signal typically contains clear spikes at glottal closure (Markel & Gray, 1976). Speakers with pathological voices often have incomplete glottal closure contrib298

Downloaded From: http://jslhr.pubs.asha.org/ by a La Trobe Univ User on 02/07/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

0022-4685/90/3302-029801.00/0

ESKENAZI ET AL.: Acoustic Correlates uting to a less distinct pattern of periodic spikes (Koike & Markel, 1975; Prosek et al., 1987). Figures 1 and 2 show illustrative examples of the speech and residue signals for a normal and pathological speaker respectively. We decided to use the residue signal in our studies because it required less interactive judgment on the part of the investigator, contained more high-frequency spectral information than the glottal volume-velocity waveform, provided information about the timing of vocal fold vibratory events, and removed the effects of the vocal tract. Most of the previous studies have attempted to identify acoustic features that could distinguish a normal voice from a voice with a vocal disorder. As in the study by Prosek et al. (1987)we wanted to examine the possibility of using features of the residue signal to predict vocal quality, including vocal disorders. A listening test was conducted to confirm the usefulness of one or more of these acoustic measures for predicting and objectifying the assessment of vocal quality. The listening test was not conducted to justify the machine classification of vocal quality, rather, it was to serve as a method for validating the possible clinical utility of any of the acoustic measures we assessed. Thus, in brief, the purpose of our study was to relate the parameters of the residue signal obtained by analyzing the speech signal to the perceived vocal qualities judged by trained listeners. The acoustic parameters were evaluated as possible predictors of vocal quality using multiple linear regression analysis.

PROCEDURES

Acoustic Measures Following the completion of a pilot study by Eskenazi (1988), six measures were implemented.

FIGURE1. Speech (top panel) and residue signals (bottom panel) for a normal male speaker (Ml). The time scale is from 0 to 128 ms.

Downloaded From: http://jslhr.pubs.asha.org/ by a La Trobe Univ User on 02/07/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

299

FIGURE2. Speech (top panel) and residue signals (bottom panel) for a pathological speaker (P13). The time scale is from 0 to 128 ms. Spectral Flatness of the Residue Signal (SFR). As defined by Markel & Gray (1976), the flatness of the magnitude spectrum (in decibels) is the logarithm of the ratio of the geometric mean of the spectrum to the arithmetic mean of the spectrum. The SFR is a negative quantity, and a large negative number corresponds to a "good voice," whereas a small negative number corresponds to a "bad voice." Davis (1976) and Prosek et al. (1987) suggested that the SFR measures the masking of fundamental frequency harmonics by noise. Pitch Amplitude (PA). The PA is the maximum amplitude of the normalized autocorrelation function of the residue signal that does not appear at the origin (Rabiner & Schafer, 1978). This maximum peak usually corresponds to the second peak in the normalized autocorrelation function. The first peak is unity amplitude and occurs at the origin. Figure 3 presents examples of PASfor normal and pathological voices. Since the PA represents the degree of voicing, it is expected to be lower for abnormal (e.g., breathy) voices than for normal voices. It will also be low for normal voices phonating unvoiced sounds and voiced fricatives. Prosek et al. (1987) employed the PA in their measurement battery as well. Harmonics-to-Noise Ratio. A spectrogram of a normal voice typically shows well-developed harmonics, whereas a hoarse voice will fail to show strong harmonics (Yanagihara, 1967). Based on this observation, Yumoto, Gould, & Baer (1982) derived a new measure, the Harmonics-to-Noise Ratio (HNR), which is the ratio of the acoustic energy of the stable harmonic to that of the noise. The Harmonics-to-Noise Ratio is believed to be well correlated with the index of hoarseness (Yumoto, Gould, & Baer, 1982). By its nature, this measure takes into account the jitter and shimmer present in the signal, which is one of its advantages, because jitter affects the spectrum of a sustained vowel bv reducing the a m ~ l i Gdes of the harmonics and introducing noise between

-

300 Journal of Speech and Hearing Research

33 298-306 June 1990

where Pi is the period of the ith cycle and N is the number of consecutive cycles analyzed. This measure was introduced by Horii (1979,1980). Most of the above measures require a precise evaluation of the pitch period, which we measured using the location of the highest peak of the autocorrelation of the residue signal (not counting the peak at the origin). The results of this latter algorithm were compared to another algorithm that measured the pitch period as the difference between the negative peaks of the differentiated electroglottogram (DEGG). The two algorithms gave comparable results (Eskenazi, 1988).

Voice Data Base

FIGURE3. Autocorrelation of residue signal a: normal male speaker (M4), b: pathological speaker (P22). them. The HNR is the only measure in this study to be extracted directly from the speech signal. Perturbation Measures. Koike et al. (1977) indicated that steady vowel sounds normally exhibit slow and relatively smooth changes in the period and suggested that perturbations be measured from a smoothed trend line when investigating rapid variations in periodicity. This leads to the concept of Perturbation Quotients. The Pitch Perturbation Quotient, or PPQ, is the ratio of a moving average of fundamental period differences to the average fundamental period, where the length of the moving average is equal to five periods. The Amplitude Perturbation Quotient (APQ) is defined as the variation in signal amplitude measured at the fundamental period divided by the mean amplitude. A moving average of five amplitude values is used for this calculation. These perturbation quotients are computed using first the positive peaks of the residue signal, then the negative peaks, with the smallest result being kept (Davis, 1976).The PPQ and APQ represent the amount of deviation from the mean period and the mean amplitude, respectively. Both quantities are expressed in percent. Prosek et al. (1987) also included PPQ and APQ in their study. Percent Jitter. This term is derived by dividing the mean jitter in ms by the mean period in ms times one hundred, where the mean jitter is

Downloaded From: http://jslhr.pubs.asha.org/ by a La Trobe Univ User on 02/07/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

The data base of speakers with no vocal disorders (normal) consisted of 25 normal male speakers with an average age of 34 years, ranging from 21-51, and 25 normal female speakers, with an average age of 28 years, ranging from 2053. None of the speakers had a history of laryngeal disease. One finds it difficult to define a "normal voice." Moore (1971) has pointed out that there are many normal voices based on such characteristics as gender and age. The location of the threshold that separates the normal from the abnormal voice is judged by each listener on the basis of his or her cultural standards, education, environment, vocal training, and similar factors, but wherever the separation between adequate and inadequate is placed, it is obvious that each individual has acquired concepts of normalcy and defectiveness (Moore, 1971).Our definition of normal voice was given by one of the authors (Hicks)as a voice with no apparent pathology (either functional or organic) and no unusual voice characteristics or habits. Each subject performed a number of vocal tasks, including counting from 1to 10, singing the musical scale, reading three sentences, and phonating vowels and fricatives. For this study, we examined the data for the vowel ti/, which each subject phonated for approximately 2 s at a comfortable loudness and pitch. The speech and electroglottographic signals were recorded with an Electrovoice, type RE-10dynamic cardoid microphone and a Synchrovoice electroglottograph. The microphone was located 6 in. from the mouth. The signals were directly digitized via a Digital Sound Corporation (DSC) 200 AID system and stored on a disk for further processing on a VAX 111'750computer. The speech signal was sharply low-pass filtered at 4800 Hz and sampled at 10 kHz with each sample being represented by 16 bits. Sampling began before the initiation of the vowel and continued for about 1 s after vowel cessation. All recordings were made in an IAC sound-treated booth. For this study we examined only the speech data.

ESKENAZI The pathologic voice data base consisted of 16 patients with vocal disorders. Two patients were remeasured twice and 3 patients were remeasured once after treatment. A description of the vocal disorders can be found in Table 1. The range of voices varied from mildly deviant to very deviant as determined by one of the authors (Hicks). The patients were asked to phonate the vowel /i/ for about 2 s. The vowel /i/ was chosen for this study because we have found this vowel particularly useful for ultra highspeed laryngeal photography (Childers & Krishnamurthy, 1985; Childers, Naik, Larar, Krishnamurthy, & Moore, 1983). Consequently, we had extensive data for this vowel.

Listening Test The panel of listeners consisted of seven judges (four males and three females). All judges were faculty in the University of Florida Speech Department familiar with the various voice qualities evaluated in this study. For this listening test, the vowel /i/ was used. A training signal representative of the average quality of the vowel /i/ to be judged was presented before each listening task. Each task signal (token) was 2 s in duration and was presented two times with an interval of 2 s between the two tokens. Prior to the presentation of each pair of test tokens a sinusoidal tone of 2 s was presented to cue the listener that a test token was about to be presented. The tokens were generated by computer via a digital-to-analog converter and were presented via headphones in a professional IAC sound room. In all, six tasks were implemented. There were five tasks for the ratings of the pathological voices: 1) overall severity, 2) hoarseness, 3) TABLE 1.

Characteristics of pathological speakers.

Patientcode

Gender

Symptoms

P1 *P2 *P3 P4 P5 *P6 P7 P8 P9

F F F F F F M M M M F M M M F F F F F F F M F

Mild Hoarseness Hyper Functional Vocal Fry P3 1 month post injection P3 3 months post injection Enlarged muscle Breathy; Hoarse P7 1 month post injection Hoarse P9 5 months post injection Posterior cyst True Vocal Cords (TVC) contact ulcer Hoarse unilateral TVC carcinoma Vocal Fry Nodules Right TVC unilateral paralysis Unilateral paralysis P17 1 month post injection P17 4 months post injection Breathy, Weak P20 1 month post injection Bilateral paralysis of TVC Bilateral nodule

P10

*P11 P12 *P13 P14 P15 *P16 *P17 P18 *P19 P20 P21 *P22 P23

*Very deviant voices.

Downloaded From: http://jslhr.pubs.asha.org/ by a La Trobe Univ User on 02/07/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

ET

AL.: Acoustic Correlates 301

breathiness, 4) roughness, and 5) vocal fry; and 6) an additional task to judge the overall excellence of the 50 normal voices. The definitions we used for these six listening tasks were: Overall severity. Defined as a comprehensive evaluation of the voice taking into account such factors as hoarseness, breathiness, roughness, and vocal fry. Hoarseness. Defined as breathy plus rough; it is, therefore, a result of a combination of excessive air escapage and an aperiodicity of vocal fold vibration.

Breathiness. Defined as audible escapage of air through

the glottis due to insufficient glottal closure; the degree of breathiness severity is inversely proportional to the length of the closed glottal phase. Roughness. Defined as low-pitched noise, presumably due to irregular vocal fold vibrations. Vocal fry. Defined perceptually to be a low-pitched, rough-sounding phonation. Excellence of normal voice. An excellent normal voice was

defined as one that appeared to have professional speech

training, for example, a voice with good glottal closure, thus providing a good excitation of the vocal tract. The normal voice was defined above.

The scale used for pathological voices was from 1 to 7 with 1 denoting the absence of the quality to be judged (e.g., no breathiness) and 7 denoting an extreme example of the quality to be judged (e.g., extreme breathiness). For

normal voices, a value of 1 denoted a poor quality voice whereas a 7 denoted an excellent quality voice. Most previous studies used either a 4-point scale (Moran & Gilbert, 1984), a 5-point scale (Kuwabara & Ohgushi, 1984; Sapozhkov, 1972), a 7-point scale (Prosek et al., 1987), a 9-point scale (Hecker & Kreul, 1971; Kreul &

Hecker, 1971), a 10-point scale (Rothauser, Urbauer, & Pachl, 1971). In our case, due to the relatively wide range of vocal qualities present in the data base, a 7-point scale was appropriate. In order to test the reliability of each judge, a selected number of tokens, unknown to the

judges, were repeated for each task. We then graded each judge on ability to make consistent judgments on the repeated tokens. Conditions of Analysis Based on the studies by Davis (1976) and Prosek et al. (1987), we adopted the following conditions of analysis: Sampling frequency: 10 kHz Filter order: 14 No preemphasis Starting point: 384 ms after start of utterance Length of interval of analysis: 128 ms with no overlap between successive frames. The vocal intensity was not controlled. Each subject phonated at a comfortable pitch and intensity level. Intensity was not considered a factor because all signals

analyzed were approximately the same magnitude after digitization. No recording or post recording amplification adjustments of gain were made. The short interval of analysis resulted in a relatively small number of pitch periods being analyzed. This

302 Journalof Speech and Hearing Research

33 298-306 June 1990

FIGURE

4. Histogram of the distribution for the 7 scores for overall severity. The coefficient of concordance is .63.

FIGURE 5. Histogram of the distribution for the 7 scores for hoarseness. The coefficient of concordance is .65.

required further investigation. We examined the influence of the number of periods analyzed on the perturbation measures and the HNR and found that there were variations depending on the number of pitch periods in the data record analyzed (Eskenazi, 1988). We concluded that an interval of 50 periods was the best compromise, which was in agreement with the findings of Yumoto, Gould, & Baer (1982). Hence, for the multiple linear regression analyses between the listening test ratings and the acoustic measures, we analyzed data intervals containing at least 50 pitch periods and updated the inverse filter coefficients every 25.6 ms. For those measures that required a peak picking algorithm, we used parabolic interpolation (Markel & Gray, 1976). This counterbalanced the effect of using a relatively low sampling frequency of 10 kHz (Titze, Horii, & Scherer, 1987). To further confirm our use of the 10 kHz sampling frequency we investigated the effect on our data analysis of using a 100 kHz sampling frequency for three representative patients. The results of our analysis were not affected by the sampling frequency.

distribution for the normal task tended to be bell shaped with a slight skewing toward the excellence end of the scale. In addition to the above listener evaluation we graded each judge's consistency for scoring repeated tokens in each listening task. If a particular judge did not achieve at least a 60% consistency score for a given task, then his or her ratings for that task were not included in the calculations of the means. This happened only four times for three different tasks. We believe this indicated a good intrajudge consistency, given the difficulty of the listening tasks. Pathologicalvoices. Since we expected that each vocal quality was in fact correlated to more than just one acoustic parameter, the following procedure was adopted for each task: a multiple linear regression analysis was performed, and the acoustic parameters obtained by this regression were recorded. The selection criterion we used was the prediction sum of squares (PRESS) (Allen & Cady, 1982). The acoustic parameters were used as predictors with the quality scale as a criterion. The best PRESS model (i.e., the model with the lowest PRESS value) was adopted. The acoustic correlates of each vocal quality are summarized in Table 2 along with the square of the multiple linear correlations (R2 ). The correspond-

RESULTS Analysis of the Listening Test Recall that the seven judges rated the 50 normal voices on a 1-to-7 scale of overall excellence, and the 23 pathological voices using the same 1-to-7 scale for five voice quality conditions: overall severity, hoarseness, breathiness, roughness, and vocal fry. We evaluated the listeners' agreement by computing a histogram of the distribution of the seven scores for the six listening tasks. In addition, we calculated the coefficient of concordance for each task (Winer, 1971). These data are presented in Figures 4-9. As the coefficients of concordance indicate, the listeners tended to agree for the pathological listening tasks and disagreed on the evaluation of "normal" voices. The histograms tended to be uniformly distributed across the scoring scale for the pathological tasks, whereas the

Downloaded From: http://jslhr.pubs.asha.org/ by a La Trobe Univ User on 02/07/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

FIGURE 6. Histogram of the distribution for the 7 scores for breathiness. The coefficient of concordance is .54.

ESKENAZI ET AL.: Acoustic Correlates 303

FIGURE 7. Histogram of the distribution for the 7 scores for roughness. The coefficient of concordance is .71.

FIGURE 9. Histogram of the distribution for the 7 scores for the normal speakers. The coefficient of concordance is .26.

ing multiple linear regression models are presented in Table 3. Normal voices. A mentioned earlier, the interjudge reliability was low for normal voices, preventing us from considering the means of the ratings across all judges for the single rating of overall excellence of the voice. We conducted multiple linear regressions for each vocal quality as evaluated by each judge using the acoustic parameters as predictors. The results indicated that the important parameters varied from judge to judge, and no definite trend could be derived from these results. These results suggest that each judge relied on different parameters, and apparently the parameters studied here represent only a small subset of the set of parameters used by the judges. Consequently, the only conclusions we considered were for the pathological voices

DISCUSSION The selection of the model order in regression analysis is known to involve a "trade-off between including variables that will reduce the residual sum of squares from the use of an insufficient model, and including variables that have little effect on the residual sum of squares but will increase the variance of a predicted value" (Allen & Cady, 1982, p. 254). The prediction sum of squares (PRESS) is recommended as a more realistic criterion than the residual sum of squares (Allen & Cady, 1982). For a particular set of variables PRESS is obtained by predicting each observation using all the other observations. The resulting residuals (PRESS residuals) are squared and summed to form PRESS (Allen & Cady, 1982). The PRESS criterion is to find the lowest value of the PRESS residuals. When we applied the PRESS criterion to our data we found that each vocal quality (severity, hoarseness, breathiness, roughness, and vocal fry) could be modeled by a first or second order model as listed in Table 3. Lower and higher order models had larger prediction sum of squares residuals. We also calculated the R 2 as shown in Table 2 for the PRESS condition. When we used the R2 as the model selection criterion (instead of PRESS) we calculated the model orders up to and including six and listed the residual sum of squares for comparison purposes. Although the R 2 always increased in this case the increase was not very great over that for the second order model (or first order model in the case of breathiness) selected by the PRESS criterion. The results of the regression analyses performed for the various vocal qualities brought out several points. Among the six acoustic parameters considered, the most impor-

FIGURE 8. Histogram of the distribution for the 7 scores for vocal fry. The coefficient of concordance is .51.

2. Main factors for the prediction of the quality of pathological voices (using means of listening test) with PRESS as the selection criterion. TABLE

Overall severity Main Factors

PA, HNR

R2 = 0.57

Hoarseness PA, %JIT

R2 = 0.56

Breathiness

Roughness

%JIT

SFR, HNR 2

R2 = 0.30

Downloaded From: http://jslhr.pubs.asha.org/ by a La Trobe Univ User on 02/07/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

R = 0.59

Vocal fry PA, HNR

R2 = 0.54

304 Journal of Speech and Hearing Research 3. Models derived from regression analyses for various quality scales with PRESS as the selection criterion. TABLE

Quality (Y) Overall severity Hoarseness Breathiness Roughness Vocal fry

Equation of the model Y Y Y Y Y

= = = = =

6.37 - 4.18 * PA - 0.06 * HNR 4.8 - 4.26 * PA + 0.1 * %JIT 2.27 + 0.19 * %JIT 6.49 + 0.20 * SFR - 0.13 * HNR 5.06 - 2.55 * PA - 0.09 * HNR

tant was the PA, which was a predominant factor for three of the pathological vocal qualities. The second most important factor was the HNR, which was also a factor for three of the vocal qualities. However, HNR was always a second factor. The parameter %JIT was a predictor for both hoarseness and breathiness. The PPQ and the APQ did not play a role in the judges' evaluation of vocal quality of pathological speakers. The results obtained contradict previous studies concerning the HNR. It is interesting to note that in our study the HNR did not play a role in the prediction of hoarseness, contrary to other studies (Yumoto et al., 1982; Yumoto et al., 1984). As a result of their research, Yumoto et al. (1982) concluded that pathological voices had an HNR smaller than 7.4 dB. Our results do not support this conclusion; 70% of the pathological voices analyzed had an HNR larger than 7.4 dB, and 19% of the normal voices had an HNR smaller than 7.4 dB. This finding is similar to that given by Moss & Hicks (1985). Also, contrary to the findings of Yumoto et al. (1982), female voices had a larger HNR than male voices, and this difference was statistically significant (Eskenazi, 1988). The model obtained for the prediction of breathiness was the poorest because the R2 was lower than any of the other voice qualities. The R2 never exceeded 0.5 even when all six acoustic parameters were included. Furthermore, the PRESS criterion did not vary greatly as other acoustic parameters were added to increase the model order. In other words, %JIT was the best predictor for breathiness according to the PRESS criterion. It appears that breathiness has not been studied as extensively as hoarseness. Klich (1982) noted that the number of discernible harmonics decreased as breathiness increased. This is not in contradiction with our findings because, unlike Klich (1982), we used speakers with vocal disorders for the study of breathiness, and the effects and relationship of spectral parameters may differ from normal speakers to pathologic ones. Furthermore, HNR was a factor for breathiness in our study if we considered a higher order prediction model. In their research, Wolfe and Steinfatt (1987) found that the fundamental frequency and a jitter-related measure were the only significant predictors of breathiness, with the logarithm of the period standard deviation being the most important one. It has been observed (Hammarberg et al., 1980; Issihiki, Yanagihara, & Morimoto, 1966) that the loss of high frequency components in the overall spectrum was an acoustic correlate of breathiness, which was attributed to an increase in the open phase of the laryngeal vibratory

Downloaded From: http://jslhr.pubs.asha.org/ by a La Trobe Univ User on 02/07/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

33

298-306

June 1990

cycle. This explains why the spectrographic classification system (Yanagihara, 1967), based on noise components increasing in intensity in the higher portion of the frequency range, was not very effective for the detection of breathiness. Wolfe and Steinfatt (1987) conjectured that laryngeal irregularities contributing to turbulent airflow in breathiness might be less complex than in the production of a hoarse quality. This would mean that breathy voices are closer to normal voices than are other voice types, and might explain why the prediction of breathiness is almost as difficult as the prediction of the quality of normal voices. In a study of roughness, Imaizumi (1986) concluded that the acoustic correlates of roughness include the multiplicative variations that occur over several pitch periods, as well as those that are synchronous with the vocal pitch period; he also noted that an irregularity in the speech waveform was not essential for perceptual roughness. Our results suggest that pitch perturbation measures as reflected in the HNR and SRF are indeed an important correlate of roughness. The primary goal of this study was to find the acoustic correlates of voices spanning a wide range on the vocal quality continuum. The perceptual evaluation of pathologic voices on five different scales yielded some results that were readily interpretable. On the other hand, the perceptual evaluation of normal voices was inconclusive because each judge apparently used his or her own criteria, possibly using perceptual factors not covered by the set of acoustic measures studied here. Our finding with respect to normal voices agrees with that of Prosek et al. (1987). It is likely that a vocal disorder could affect more than one dimension of the voice simultaneously. This makes the task of estimating specific attributes of the voice using acoustic parameters difficult. To be more conclusive, more information concerning the effects of specific pathologies on vocal fold vibration is needed and a more comprehensive study of larger speech samples is needed. Furthermore, a larger set of subjects should be used so that the statistics would be more meaningful. Despite these weaknesses, the results obtained in this study for pathological voices were sufficient to derive models to predict the various vocal qualities with a reasonable level of reliability. One of the weaknesses of our study and that of others resides in the fact that the acoustic measures used as predictors of quality appear to represent only a fraction of the set of all measures used by a human listener. However, our results suggest that the set of acoustic measures chosen for this study was adequate. The values of R2 obtained for the various regressions were slightly inferior to those obtained by Prosek et al. (1987) in a similar study, but this was due to the fact that by using multiple linear regression with PRESS as the selection criterion we used fewer acoustic parameters as predictors. Our conclusions for the patient data are similar to those of Prosek et al. (1987, p. 115) who suggested that "it may be possible to develop a vocal index of degree of impairment based on these (their) measurements." These au-

ESKENAZI ET AL.:

thors also suggested (p. 115) that their results "indicate how difficult it will be to develop an acoustically based vocal index that includes a quantification of specific voice qualities." Our results for the normative data also agree with this latter observation.

The main conclusion of our study was that the two best predictors for vocal quality were Pitch Amplitude (PA) and Harmonics-to-Noise Ratio. This result is consistent with the findings of Prosek et al. (1987). The results concerning pathological voices were:

Acoustic Correlates 305

HECKER, M. H., & KREUL, E. J. (1971). Descriptions of speech of patients with cancer of the vocal folds. Part I: Measures of fundamental frequency. Journal of Acoustical Society of America, 49, 1275-1282. HILLENBRAND, J. (1987). A methodological study of perturbation and additive noise in synthetically generated voice signals. Journal of Speech and HearingResearch, 30, 448-61. HILLMAN, R. E., & WEINBERG, B. (1981). Estimation of volume

velocity waveform properties. A review and study of some methodological assumptions. In N. Lass (Ed.), Speech and language: Advances in basic research and practice (pp. 411473). New York: Academic Press. HOLLIEN, H., MICHEL, J., & DOHERTY, E. T. (1973). A method

1. overall quality was characterized by a low PA and a low 2. 3. 4. 5.

HNR, a hoarse voice exhibited a low PA and a high %JIT, a breathy voice was characterized by high %JIT, a rough voice was characterized by a low SFR and a low HNR, vocal fry was characterized by a low PA and a low HNR.

This study showed that various patient vocal qualities can be predicted with some acoustic measures, and that the degree of reliability of this prediction may be improved by adding other efficient and independent acoustic measures to the list of predictors. ACKNOWLEDGMENTS This research was primarily supported by NIH grants NIDCD

R01 DC00577 and NINCDS R01 NS17078 with additional support from the University of Florida Center of Excellence Program in Information Transfer and Processing, and the MindMachine Interaction Research Center.

REFERENCES ALLEN, D. M., & CADY, F. B. (1982). Analyzing experimental

data by regression. Belmont, CA: Wadsworth. CHILDERS, D. G., & KRISHNAMURTHY, A. K. (1985). A critical review of electroglottography. CRC Critical Reviews in Biomedical Engineering, 12, 131-161. CHILDERS, D. G., NAIK, J. M., LARAR, J. N., KRISHNAMURTHY, A. K., & MOORE, G. P. (1983). Electroglottography, speech and ultra-high speed cinematography. In I. R. Titze & R. Scherer (Eds.), Vocal fold physiology and biophysics of voice (pp. 202-220). Denver, CO: Denver Center for the Performing Arts. CHILDERS, D. G. &Wu, K. (1990). Quality of speech produced by analysis-synthesis. Speech Communication, 9(1). COLTON, R. H., & ESTILL, J. A. (1981). Elements of voice quality: Perceptual, acoustic, and physiologic aspects. Speech and Language, 5, 311-403. DAVIS, B. S. (1976). Computer evaluationof laryngeal pathology based on inverse filtering of speech. Speech Communication Research Laboratory, Monograph Number 13, Santa Barbara, CA. ESKENAZI, L. (1988). Acoustic correlates of voice quality and distortion measure for speech processing. Unpublished doctoral dissertation, University of Florida, Gainseville. FUKAZAWA, T., EL-ASSUOOFY, A., & HONJO, J. (1988). A new index for evaluation of the turbulent noise in pathological voice. Journal of the Acoustical Society of America, 83(3), 1189-1193. HAMMARBERG, B., FRITZELL, B., GAUFFIN, J., SUNDBERG, J., &

L. (1980). Perceptual and acoustic correlates of abnormal voice qualities. Acta Otolaryngology, 90, 441-451. WEDIN,

Downloaded From: http://jslhr.pubs.asha.org/ by a La Trobe Univ User on 02/07/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

for analyzing vocal jitter in sustained phonation. Journal of Phonetics, 1, 85-91. HoRII, Y. (1979). Fundamental frequency perturbation observed in sustained phonation. Journal of Speech and Hearing Research, 22, 5-19. HORn, Y. (1980). Vocal shimmer in sustained phonation. Journal of Speech and Hearing Research, 23, 202-209. IMAIZUMI, S. (1986). Acoustic measures of roughness in pathological voice. Journal of Phonetics, 14, 457-462. ISSIHIKI, N., YANAGIHARA, N., & MORIMOTO, M. (1966). Approach to the objective diagnosis of hoarseness. Folia Phoniatrica, 18(6), 393-400. JAVKIN, H. R., ANTONANZAS-BARROSO,

N., & MADDIESON, I.

(1987). Digital inverse filtering for linguistic research. Journal of Speech and Hearing Research, 30, 122-129. KASUYA, H., KOBAYASHI, Y., & KOBAYASHI, T. (1983). Character-

istics of pitch period and amplitude perturbations in pathological voice. Proceedings of the InternationalIEEE Conference on Acoustics, Speech, and Signal Processing. New York, NY: Institute of Electrical and Electronic Engineers, 1372-1375. KASUYA, H., MASUBUCHI, K., EBIHARA, S., & YOSHIDA, H.

(1986). Preliminary experiments on voice screening.Journalof Phonetics, 14, 463-468. KASUYA, H., OGAWA, S., KIKUSHI, Y., & EBIHARA, S. (1986). An

acoustic analysis of pathological voice and its application to the evaluation of laryngeal pathology. Speech Communications, 5, 171-181. KITAJIMA, K. (1981). Quantitative evaluation of the noise level in the pathological voice. Folia Phoniatrica,33, 115-124. KLATT, D. H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82, 737-793. KLICH, R. J. (1982). Relationships of vowel characteristics to listener ratings of breathiness. Journal of Speech and Hearing Research, 25, 574-580. KLINHOLZ, F., & MARTIN, F. (1985). Quantitative spectral evaluation of shimmer and jitter. Journal of Speech and Hearing Research, 28, 169-174. KOIKE, Y., & MARKEL, J. (1975). Application of inverse filtering for detecting laryngeal pathology. Annals of Otology, Rhinology and Laryngology, 84(1), 117-124. KOIKE, Y., TAKAHASHI, H., & CALCATERRA, T. C. (1977). Acous-

tic measures for detecting laryngeal pathology. Acta Otolaryngologica, 84, 105-117. KREUL, E. J., & HECKER, M. H. L. (1971). Description of the speech of patients with cancer of the vocal folds. Part II: Judgements of age and voice quality. Journalof the Acoustical Society of America, 49, 1283-1287. KUWABARA, H., & OHGUSHI, K. (1984). Experiments of voice quality of vowels in males and females and correlation with acoustic features. Language and Speech, 27, 135-145. LAVER, J. (1980). The phonetic description of voice quality. Cambridge: Cambridge University Press. LIEBERMAN, P. (1961). Perturbation in vocal pitch. Journalof the Acoustical Society of America, 33, 597-603. LIEBERMAN, P. (1963). Some acoustic measures of the fundamental periodicity of normal and pathological larynges. Journal of the Acoustical Society of America, 35, 344-353. LUDLOW, C. L., BASSICH, C. J., CONNOR, N. P., COULTER, D. C.,

& LEE, Y. J. (1987). The validity of using phonatory jitter and

306 Journalof Speech and HearingResearch shimmer to detect laryngeal pathology. In T. Baer, C. Sasaki, & K. Harris (Eds.), Laryngealfunction in phonation and respiration (pp. 492-508). Boston, MA: College-Hill Press. MARKEL, J. D., & GRAY, A. H. (1976). Linear prediction of speech. New York: Springer-Verlag. MILENKOVIC, P. (1987). Least mean square measures of voice perturbation. Journal of Speech and Hearing Research, 30, 529-538. MORAN, M. J., &GILBERT, H. R. (1984). Relation between voice profile ratings and aerodynamic and acoustic parameters. Journal of Communication Disorders, 17, 245-260. MOORE, G. P. (1971). Organic voice disorders.Englewood Cliffs, NJ: Prentice-Hall. Moss, S. E., & HICKS, D. M. (November 1985). Acoustical analysis of three phonatory behaviors in adducted spastic dysphonia. Papers presented at the American Speech-LanguageHearing Association Convention, Washington, DC. MURRY, T., & SINGH, S. (1980). Multidimensional analysis of male and female voices. Journal of the Acoustical Society of America, 68(5), 1294-1300. MUTA,

H.,

MURAOKA,

T., WAGATSUMA,

K., FUKUDA, H.,

TAKAYAMA, E., FuJIoKA, T., & KANOU, S. (1987). Analysis of hoarse voices using the LPC Method. In T. Baer, C. Sasaki, & K. Harris (Eds.), Laryngeal function in phonation and respiration (pp. 463-474). Boston, MA: College-Hill Press. PINTO, N. B., CHILDERS, D. G., & LALWANI, A. (1989). Formant

speech synthesis: Improving production quality. IEEE Transcripts on Acoustics, Speech, and Signal Processing, 37, No. 12, 1870-1887. PROSEK, A. R., MONTGOMERY, B. E., & HAWKINS, D. B. (1987). An evaluation of residue features as correlates of voice disorders. Journal of Communication Disorders, 20, 105-117. QUACKENBUSH, S. R., BARNWELL, T. P., & CLEMENTS, M. A.

(1988). Objective measures of speech quality. Englewood Cliffs, NJ: Prentice-Hall. RABINER, L. R., & SCHAFER, R. W. (1978). Digitalprocessing of

speech signals. Englewood Cliffs, NJ: Prentice-Hall, Inc. ROTHAUSER, E. H., URBAUER, G. E., & PACHL, W. P. (1971). A

comparison of preference measurement methods. Journal of the Acoustical Society of America, 49, 1291-1308. ROTHENBERG, M. R. (1973). A new inverse-filtering technique for deriving the glottal air flow waveform during voicing. Journalof the Acoustical Society of America, 53, 1632-1645. ROTHENBERG, M. R., & ZAHORIAN, S. (1977). Nonlinear inverse

filtering technique for estimating the glottal-area waveform. Journal of the Acoustical Society of America, 61(4), 10631071. SANSONE, F. E., JR., & EMANUEL, F. W. (1970). Spectral noise

Downloaded From: http://jslhr.pubs.asha.org/ by a La Trobe Univ User on 02/07/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

33 298-06

June 1990

levels and roughness severity ratings for normal and simulated rough vowels produced by adult males. Journalof Speech and Hearing Research, 13, 489-502. SAPOZHKOV, M. A. (1972). Method of investigative and rating speed-transmission equipment. Society of Physics-Acoustics, 18, 80-83. SINGH, S., & MURRY, T. (1978). Multidimensional classification of normal voice qualities. Journal of the Acoustical Society of America, 64(1), 81-87. SMITH, B. E., WEINBERG, B., FETH, L. F., & HORn, Y. (1978). Vocal roughness and jitter characteristics of vowels produced by esophageal speakers. Journal of Speech and Hearing Research, 21, 240-249. SORENSEN, B. E., & HORn, Y. (1984). Directional perturbation factors for jitter and shimmer. Journal of Communication Disorders,17, 143-151. TITZE, I. R., HORn, Y., & SCHERER, R. C. (1987). Some technical

considerations in voice perturbation measurements.Journalof Speech and Hearing Research, 30, 252-260. VOIERS, W. D. (1977). Diagnostic acceptability measure for speech communications systems. Proceedings of the 1977 International IEEE Conference on Acoustics, Speech, and Signal Processing. New York: Institute of Electrical and Electronic Engineers, 204-207. WENDAHL, R. W. (1966). Laryngeal analog synthesis of jitter and shimmer-Auditory parameters of harshness. Folia Phoniatrica, 18, 98-108. WINER, B. J. (1971). Statistical principles in experimental design. New York: McGraw-Hill. WOLFE, V. I., & STEINFATT, T. M. (1987). Prediction of vocal

severity within and across voice types. Journalof Speech and Hearing Research, 30, 230-240. YANAGIHARA, N. (1967). Significance of harmonic changes and

noise components in hoarseness. Journalof Speech and Hearing Research, 10, 531-541. YUMOTO, E., GOULD, W. J., & BAER, T. (1982). Harmonics-

to-noise ratio as an index of the degree of hoarseness. Journal of the Acoustical Society of America, 71, 1544-1550. YUMOTO, E., SASAKI, Y., OKAMURA, H. (1984). Harmonics-to-

noise ratio and psychophysical measurement of the degree of hoarseness. Journalof Speech and HearingResearch, 27, 2-6. Received December 13, 1988 Accepted December 11, 1989 Requests for reprints should be sent to D. G. Childers, University of Florida, Mind-Machine Interaction Research Center, Department of Electrical Engineering, Gainesville, FL 32611.

Acoustic correlates of vocal quality.

We have investigated the relationship between various voice qualities and several acoustic measures made from the vowel /i/ phonated by subjects with ...
1MB Sizes 0 Downloads 0 Views