SOME STATISTICAL CHARACTERISTICS OF VOICE FUNDAMENTAL FREQUENCY YOSHIYUKI HORII

Purdue University, West Lafayette, Indiana Two experiments are reported in which the magnitude of sampling errors associated with estimates of the mean, median, and standard deviation of voice fundamental frequencies (fo) during oral reading is investigated as a function of sample size. In one experiment, voices are sampled with fixed time windows. In the other experiment, results of fo analysis are compared for single-sentence voice samples and paragraph voice samples. Overall shape of fo distributions as well as interrelationships among various distributional measures are discussed. In studies of the fundamental frequency (fo) of normal and pathological voices, measures such as the mean, median, standard deviation, and mid-90% range are often used to characterize data. The length of the voice samples investigated often differs from one study to another, however, raising a question as to the magnitude of sampling error associated with the various measures. In addition, interpretation and comparison of these data sometimes pose difficulty due to the particular choice of a statistic employed by an investigator. The purpose of the present study was to investigate some statistical characteristics of voice fundamental frequency during oral reading. In particular, the magnitudes of errors associated with estimates of the mean, median, and standard deviation of fo distributions as a function of sample size are investigated using two different sampling methods. For the first experiment, the voice is sampled with fixed time windows and thus without regard to the linguistic units such as words or sentences. In the second experiment, the sentence is used as the unit of sampling. Interrelationships among various distributional measures are also studied for data obtained from the oral reading of a six-sentence paragraph. All fo data were obtained by a computer program that used a peak-picking method (similar to that described by Gold, 1962) operating on speech waveform sampled and digitized 10,000 times per second. In essence, the program uses second-order maxima of the amplitude of the waveform as primary candidates for the peak in a period. Amplitude and period information of preceding segments are also used to decide which candidate should be tentatively accepted as the "pitch" peak. After all tentative decisions are made, good strings and bad strings of period values are labeled, a period being considered good if the value is within 6~ of immediately preceding period values. The pro192

Downloaded From: http://jslhr.pubs.asha.org/ by a Northern Illinois University User on 08/28/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

gram then tries to expand good strings backward and forward by looking for better candidates of the peaks in the waveform data. As output, the program provides a conventional melody plot, distribution histograms, and statistical data such as the mean, median, standard deviation, mid-90% range, and skewness of the fundamental frequency distributions. The performance of the fo analysis program was tested by processing an utterance of "The rainbow is a division of white light into many beautiful colors," and then comparing the results with those obtained independently via period-by-period hand measurements of the same utterance from oscillographic tracings. For this material, the overall root-mean-square difference between the two sets of data was 0.9 Hz, and the difference between the mean fundamental frequencies calculated by the program and by the hand measurement was 0.7 Hz. Existence of periodicity (or voicing) was properly detected by the program about 96g of the time without false alarm, that is, out of 317 fundamental oscillations determined by hand measurements of an oscillographic tracing of the utterance, 304 oscillations were identified by the computer program. Most omissions or undetected voicing portions were at transitional segments of low-amplitude voiced consonants. Reliability of the computer procedure was also tested by redigitizing and reanalyzing 13 times, an audio recording of the first paragraph of "The Rainbow Passage" (Fairbanks, 1960) as read by an adult male. The range of the mean fundamental frequencies obtained by the repeated processings was less than 0.5 Hz and the standard deviation of the 13 fo means was 0.11 Hz, indicating high repeatability. EXPERIMENT I

Subjects, Speech Materials, and Analysis Procedures High-quality magnetic tape recordings were made during oral reading of the entire Rainbow Passage by 10 adults, six males and four females, who had no history of speech, hearing, or reading problems. The passage consisted of 331 words and required about two minutes to read. Subiects were instructed to familiarize themselves with the passage and to read it at a comfortable vocal intensity. The recorded speech samples were digitized and stored on computer magnetic tape. For subsequent analysis, the total utterance for each talker was divided into 3.6-second segments. These were used as the smallest units of sample size. The segment length of 3.6 seconds was related to the data-buffer size of the program and happened to be about the duration of short sentences. The means, medians, standard deviations, and mid-90% ranges of the fundamental frequencies were calculated for each segment and for various sizes of samples that were multiples of the 3.6-second segments. Following the practice of most fo investigators, measures of central tendency were expressed in Hertz while measures of variability were expressed in semitones. In selecting samples of larger sizes, a restriction was placed so that any one sample consisted of a series of contiguous 3.6-second segments and not of randomly seHoRn: Voice Fundamental Frequency 193

Downloaded From: http://jslhr.pubs.asha.org/ by a Northern Illinois University User on 08/28/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

lected segments. This procedure was considered reasonable since, in practice, only one portion of an utterance is usually subjected to fo analysis. Sample sizes used for this experiment were of 3.6-, 7.2-, 14.4-, 28.8-, 57.6-, 72.0-, and 90.0-second duration.

Results and Discussion Distributions of fo means and medians as a function of sample size are shown in Figure 1 for subject JP. The abscissa is sample size in seconds and the |

!

I

I

x

9 N

~13o

l

I

TALKER JP x MEAN

x

9

MEDIAN

x

,.=,

,~,zo =E I e

a z i~.

| o

II0

xo

ie

z:z

~i.4

z~.8

SAMPLE SIZE (Sec.)

5~.e

96.0

FIctrm~ 1. Mean (crosses) and median (circles) fundamental frequencies as a function of sample size for oral reading of the Rainbow Passage by Subject JP. Variability of mean and median fundamental frequencies is indicated by solid and dashed lines that correspond to twice the standard deviations of the means and medians, respectively. ordinate represents fundamental frequency in Hertz. Means are represented by crosses and medians by circles. The standard deviations of the means and medians are indicated in the figure by the solid and dashed lines, respectively. As might be expected from statistical sampling theory, the variability of the means and medians decreases as the sampling size increases. The variability of the means and medians is roughly the same. The medians converge to a value lower than that to which the means do, indicating that the distribution of individual voice fundamental frequencies is positively skewed. The size of standard deviations of the fo means as a function of sample size indicated by the solid lines for talker JP in Figure 1 is replotted in Figure 2, together with the data for the remainder of the subjects. The abscissa is again 194 ]ournal o~ Speech and Hearing Research

18 192-201 1975

Downloaded From: http://jslhr.pubs.asha.org/ by a Northern Illinois University User on 08/28/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

\

o

%

gM

\

Igt~.E

\

", \\

~e

~4

\

eH\ %

\ it 2

.~-\ ,

Fictrrm 2. Standard deviation of mean fundamental frequencies as a function of sample size. Solid lines are for male subjects and dashed lines are for female subjects. sample size in seconds and the ordinate represents standard deviation of the fo means in Hz. Solid and dashed lines represent data for the male and female subjects, respectively. Individual differences are rather large although, for the subjects employed, there are no systematic differences between the male and female talkers in the convergence behavior of the standard deviation of the means. The standard deviation of the means decreases on the average from 7 Hz at a sample size of 3.6 seconds to 1 Hz at a sample size of about one minute. This is a rather slow convergence compared to that predicted by sampling theory, probably due to the fact that the samples are taken sequentially and are not independent. Examination of Figure 2 also reveals that, except for subject CR, the rate of reduction in the standard deviation of fo means is faster initially and then slows as the sample size is increased. Variations of measures of variability, namely, standard deviations of fundamental frequencies within each sample, were also investigated as a function of sample size. Examples of these variations are shown for five subjects in Figure 3. In the figure, mean standard deviations (in semitones ) a r e plotted as a function of sample size (in seconds), with vertical arrows indicating the standard deviation of the sample standard deviations. Solid lines connecting circles are data for male subjects and dashed lines connecting crosses are data for female subjects. The data for the remaining five subjects fall between Subjects AH and Hom1: Voice Fundamental Frequency 195

Downloaded From: http://jslhr.pubs.asha.org/ by a Northern Illinois University User on 08/28/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

4-

1 ,

~

l 3~6

7.'2 ~.4 2~8 SAMPLE SIZE (See.)

5~6

90.0

FxatJrm 3. Mean standard deviation of fundamental frequencies as a function of sample size. The abscissa is sample size in seconds, and the ordinate represents mean standard deviation in semitones. Arrows indicate the magnitudes of standard deviations of the sample standard deviations. PB and are not shown. As can be seen, the variability of the sample standard deviations decreases as sample size increases. The means of the standard deviations for short samples slightly underestimate the standard deviation for the entire passage. The measure of mid-90~ range was less stable than the measure of standard deviation although the two measures were highly correlated. The data presented to this juncture provide some clues as to the magnitude of errors involved in estimates of various fo measures. These pertain to a given size of speech sample when the sampling is done with fixed time windows and without regard to the linguistic units such as words and sentences. Experiment II dealt with the sampling of voices using a linguistic unit, namely, a sentence. EXPERIMENT

II

Subiects , Speech Materials, and Analysis Procedures High-quality recordings were made of the first paragraph of the Rainbow Passage read at a comfortable vocal intensity by 65 adult males ranging in age from 26 to 79 years with the mean age of 54.1 years. The paragraph consisted of six sentences with a total of 98 words. For each of the 65 talkers, means, standard deviations, and mid-90~ ranges were obtained for the entire paragraph and for each of the six sentences in the paragraph. Pearson product-moment correlation coefficients were calculated for each of the fo measures between the paragraph data and sentence data. Furthermore, the magnitudes of standard errors in estimating these fo measures from single-sentence voice samples were calculated for each sentence using linear regression equations. 196 1ournal of Speech and Hearing Research

18 192-201 1975

Downloaded From: http://jslhr.pubs.asha.org/ by a Northern Illinois University User on 08/28/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

Results and Discussion General Results of fo Analysis. General results of fo analysis for the paragraph readings by the 65 adult male talkers are summarized in Table 1. Means, TABLE 1. Mean, standard deviation (SD), and range of various fo measures for the first paragraph of the Rainbow Passage read by 65 adult male subjects. Hz = Hertz, ST = semitones, Sk = skewness.

Mean (Hz) Median (Hz) SD (ST) Mid-90~(ST) Mean SD Range

112.5 17.3 84-151

110.7 17.1 82-150

2.41 7.95 0.48 1.65 1.46-3.54 4.60-11.83

Sk (Hz)

Sk (ST)

0.40 0.53 - 0.70-3.27

--0.12 0.43 - 1.12-0.76

standard deviations, and ranges are shown for measures of central tendency (mean and median), dispersion (standard deviation and mid-90~ range) and the skewness of the fo distributions. Skewness (Sk) was defined as the ratio of third-moment to cube of the standard deviation and calculated by the formula Sk = (~;(x - ~ ) 3 1 N ) l ( ~ x - ~)21N)8/~. The mean of the mean fundamental frequencies was 112.5 Hz, while the mean of the standard deviations was 2.41 semitones for the 65 talkers. Average median fundamental frequency was 110.7 Hz, slightly less than the average mean fundamental frequency (by --- 2 Hz). Results of skewness analysis showed that on the average the distribution was positively skewed for the fo's expressed in Hertz (Sk = 0.4) while the distribution of fo's expressed in semitones re zero frequency level, that is, 16.35 Hz, showed slight negative skewness (Sk = - 0 . 1 2 ) . There were, however, individual differences in skewness characteristics as indicated by the range of Sk's covering both negative and positive values. Zemlin (1968) states that the distribution of fo's is negatively skewed while Cowan (1936), after an extensive investigation of stage speech, concludes that fo's are more or less normally distributed. When the present data were converted to vocal period distributions, the skewness values ranged from -.0.38 to 3.14 with a mean of 0.69. The values are in general agreement with the data reported by Mikheev (1971) who investigated vocal period distributions of one-sided telephone conversations by six male and five female Russian speakers. Interrelationships among various distributional measures are summarized in Table 2, which presents Pearson product-moment correlation coefficients beTABLE 2. Correlations between various fo measures for the first paragraph of the Rainbow Passage read by 65 adult male subjects. Hz = Hertz, ST = semitones, Sk - skewness.

Mean ( Hz ) Median (Hz)

SD (ST) Mid-90~ (ST) Sk ( Hz )

Median (Hz)

SD (ST)

Mid-90g(ST)

0.995 -

0.136 0.103 -

0.107 0.076 0.982 .

-

.

.

.

Sk (Hz) - 0.201 - 0.257 0.180 0.188

Sk (ST) - 0.177 - 0.246 0.135 0.153 0.784

Horun Voice Fundamental Frequency

Downloaded From: http://jslhr.pubs.asha.org/ by a Northern Illinois University User on 08/28/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

197

tween the measures. The coefficient between the means and medians is 0.995 while that for the standard deviations and the mid-90~ ranges is 0.982. A slight negative correlation is indicated between the mean fo's and skewness. Single-Sentence fo Analysis. The validity of assuming that various fo measures obtained from single-sentence analyses are representative of general fo characteristics for the paragraph readings was investigated by means of correlational analysis. Figure 4 presents a typical result of such analysis, and shows

140

d:3.0 Hz

N

~120 ILl

~E

i~/176

(NISS)

80.

eO

100 120 1 SECOND SENTENOZ MEAN Fo (H~

FzGtre~ 4. Distribution of mean fo'S of the paragraph voice samples (ordinate) plotted against mean fo's of the secondsentence voice samples (abscissa) for oral reading of the Rainbow Passage by 65 male adults. Magnitudes of correlation and standard error are also indicated. the mean fo of the paragraph voice samples (ordinate) plotted against the mean fo of the second-sentence voice samples (abscissa) for the 65 talkers. For these data, the Pearson product-moment correlation coefficient is 0.985 and the standard error in estimating paragraph fo means from the second-sentence fo means was found to be 3.0 Hz, as indicated in the figure. In other words, given an fo mean for the second-sentence voice sample, the corresponding fo mean for the entire paragraph can be estimated within + 3.0 Hz two-thirds of the time. A summary of the analysis for each sentence is given in Table 3, which shows Pearson product-moment correlation coefficients calculated for means, medians, standard deviations, and mid-90~ ranges between each of the six sentences and the entire paragraph voice samples with associated standard 198 ]ournal of Speech and Hearing Research

18 192-201 1975

Downloaded From: http://jslhr.pubs.asha.org/ by a Northern Illinois University User on 08/28/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

TABLE 3. Correlation of various fo measures between single-sentence and paragraph voice samples (subjects = 65). Standard errors are in parentheses. Sample sizes in number of words and in seconds are also indicated. Hz = Hertz, ST = semitones. Sentence First Second Third Fourth Fifth Sixth Average

Means(Hz)

Medians(Hz)

SD (ST)

Mid-90%(ST)

Sample Size (averageduration)

0.978 ( 3.64 ) 0.985 ( 3.04 ) 0.992 (2.23) 0.987 ( 2.82 ) 0.956 (5.13) 0.989 (2.60) 0.981 ( 3.24 )

0.983 ( 3.20 ) 0.988 ( 2.69 ) 0.991 (2.34) 0.984 ( 3.07 ) 0.950 (5.40) 0.899 (3,13) 0.966 ( 3.64 )

0.833 ( 0.27 ) 0.820 ( 0.28 ) 0.924 (0.19) 0.823 ( 0.28 ) 0.659 (0.37) 0.989 (0.22) 0.841 ( 0.27 )

0.813 ( 0.98 ) 0.747 ( 1.11 ) 0.899 (0.73) 0.813 ( 0.98 ) 0.607 (1.33) 0.849 (0.89) 0.786 ( 1.00 )

17 words ( 7 see ) 12 words ( 5 see ) 22 words (8 see) 13 words ( 5 see ) 8 words (3 sec) 26 words (9 see) 16 words ( 6 sec )

errors of estimates indicated in parentheses. Sentence lengths are also given in the sixth column of the table. In general, measures of central tendency (means and medians) had higher correlations than measures of variability (standard deviations and mid-90% ranges) between single-sentence and paragraph measurements. Average Pearson product-moment correlation coefficients for the means and medians were 0.981 and 0.966, respectively, with associated standard errors of 3.24 and 3.64 Hz (approximately 3% of the total mean or median), while coefficients for the standard deviations and mid-90% ranges were 0.841 and 0.786, respectively, with associated standard errors of 0.27 and 1.00 semitones (approximately 12~ of the average standard deviation or mid-90~ range). Correlations tended to be lower for the initial and final sentences than for the middle sentences, and longer sentences yielded higher correlation coefficients and smaller standard errors of estimates. The value of the coefficient between the mean fo of the second sentence and the mean fo of the paragraph, that is, r = 0.985 agrees with the r = 0.987 obtained by Huntington 1 who used the same script read by 13 adult females. High correlations are to be expected since the variables under study were not independent but rather one set of data (sentence data) was actually a subset of the other (paragraph data). The longer the sentences, therefore, the higher the correlations to be expected. Commonly, when the first paragraph of the Rainbow Passage is used for fo studies, either the second or fourth sentence is selected for analysis in an attempt to avoid possible initial or final sentence effects, and also to limit the sample to sentences with sufficient but manageable length of duration, that is, neither too short nor too long. The results of this experiment support such choice for the study of central tendency of fundamental frequency distributions 1D. Huntington, personal communication. Horm: Voice Fundamental Frequency 199

Downloaded From: http://jslhr.pubs.asha.org/ by a Northern Illinois University User on 08/28/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

if just one sentence is to be selected. For the study of fo variability, however, use of single sentences should be avoided since comparatively low correlations and large standard errors of estimates were demonstrated. The effect of the two types of sampling, namely, one by linguistic unit (sentence) and the other by fixed time windows, on the magnitude of sampling errors can be inferred by comparing the results of the two experiments. In the present experiment, as shown in Table 3, an average standard error of 3.24 Hz was obtained in estimating paragraph mean fundamental frequency from mean fo of single-sentence voice samples that had an average duration of six seconds. Referring to Figure 2, it can be seen that at a sample size of six seconds, the average standard error is about 5 Hz. To achieve an accuracy of about 3.2 Hz, it is necessary to analyze on the average about 14 seconds of speech. That is, by selecting the same sentence for all the talkers rather than the voice samples of equivalent duration without respect to linguistic contents, the magnitude of errors in estimating passage fo means is significantly reduced. This reduction of error is probably due to the fact that an overall fo contour associated with a given sentence (which usually comprises a breath group or multiples of breath groups) in a given context is relatively stable. This mean fo for that sentence holds a relatively constant relation to the mean fo for the paragraph read by a given subject. With longer sample sizes that include several sentences or more, however, the difference in the magnitude of errors associated with the two sampling procedures is expected to diminish. The statistical estimates were made for a particular oral reading of a particular text. Determination of size of samples necessary for characterizing voice fundamental frequency in oral reading in general, or more broadly, in speaking voices, requires further investigation of various factors that influence individuals' usage of voice. In particular, variability of fo characteristics in repeated readings of the same text, effects of different types of texts, day-to-day variations, and the general relation of fo characteristics between impromptu or conversational speech and oral reading should be taken into consideration (Snidecor, 1943; Mysak, 1958; Cooper and Yanagihara, 1971). A variety of fo analysis techniques that use high-speed digital computers (Gold, 1962; Noll, 1964; Hollien, 1966; Miller, 1968; Atal and Hanauer, 1971), including the one used in the present study, are most suited for such studies that inevitably involve processing of large speech samples collected from large numbers of normal and pathological subjects. ACKNOWLEDGMENT William J. Ryan provided some of the speech recordings used in this report. His contributions are acknowledged with gratitude. This research was supported, in part, by the National Institute of Dental Research Grant 1-R01-DE-02815-03, and, in part, by the Air Force Cambridge Research Laboratories, Office of Aerospace Research under Contract F 19628-72-C-0154.

200 1ournal of Speech and Hearing Research

18 192-201 1975

Downloaded From: http://jslhr.pubs.asha.org/ by a Northern Illinois University User on 08/28/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

REFERENCES ATAL, B., and HANXUEa, S., Speech analysis and synthesis by linear prediction of the sp.eech wave. I. acoust. Soc. Amer., 50, 637-655 (1971). CooPER, M., and YnNAGmAan, N., A study of the basal pitch level variations found in the normal speaking voices of males and females. I. commun. Dis., 3, 261-266 ( 1971 ). COWAN, M., Pitch and intensity characteristics of stage speech. Arch. Speech Suppl. I, 1-92 (1936). FAmBANKS,G., Voice and Articulation Drillbook. New York: Harper ( 1960 ). GOLD, B., Computer program for pitch extraction. J. acoust. Soc. Amer., 34, 916-921 (1962). HOLLmN, H., Fundamental frequency indicator. Paper presented at the Annual Convention of the American Speech and Hearing Association, Chicago (November 1966). MIKa~Ev, Y., Statistical distribution of the periods of the fundamental tone of Russian speech. Soviet Physics-Acoustics, 16, 474-477 ( 1971 ). MmLER, R., Performance characteristics of an experimental harmonic identification pitch extraction (HIPEX) system. J. acoust. Soc. Amer., 47, 1593-1601 (1968). MYsAK, E., Gerontological processes in speech: Pitch and duration characteristics. Doctoral dissertation, Purdue Univ. ( 1958 ). NOLL, M., Short-time spectrum and cepstrum techniques for vocal-pitch detection. J. acoust. Soc. Amer., 36, 296-302 (1964). SNIDECOR, J., A comparative study of the pitch and duration characteristics of impromptu speaking and oral reading. Speech Monogr., 10, 50-56 ( 1943 ). ZEMLIN, W., Speech and Hearing Science. Englewood Cliffs, N.J.: Prentice-Hall (1968). Received March 1, 1974. Accepted July 14, 1974.

Hoeai: Voice Fundamental Frequency

Downloaded From: http://jslhr.pubs.asha.org/ by a Northern Illinois University User on 08/28/2016 Terms of Use: http://pubs.asha.org/ss/rights_and_permissions.aspx

201

Some statistical characteristics of voice fundamental frequency.

Two experiments are reported in which the magnitude of sampling errors associated with estimates of the mean, median, and standard deviation of voice ...
686KB Sizes 0 Downloads 0 Views