Hearing "words" without words: Prosodic cues for word perception Lloyd H. Nakatani Bell Laboratories, Murray Hill, New Jersey07974

Judith A. Schaffer

Douglass College,NewBrunswick, NewJersey08903 (Received15 March 1977;revised18 August1977)

We foundthat the stresspatternand rhythmof speechwereprosodic cuesfor word perception in trisyllabic adjective-noun phrases. Whenlisteners hearda "mamama"nonsense phrasethat mimicked an actualEnglishadjective-noun sequence suchas"newresult,"theycorrectly parsed (divided) thephrase into the words"ma" and "mama" corresponding to "new" and "result."Listenersmust have used

prosodic features to parsethephras•es because thespeech hadthestress pattern, rhythm, andpitch prosody of normalspeech, but had noneof the usualmeaning and soundcuesfor words.The stress patternwasa prosodic cuefor wordperception because thephrases wereeasyor hardto parsedepending on theirstress pattern.And rhythmwasalsoa prosodic cuefor wordperception because theparsing of a phrasewaschanged whenits rhythmwaschanged via synthetic speech. But the pitchand amplitude contours werenot prosodic cuesfor wordperception because the parsingof a phraseremained unchanged whenits pitchandamplitude contours werechanged via synthetic speech. Theresults suggest that,in the rulesynthesis of speech, therulesfor stress andrhythmmustbe carefully formulated because theyaffect not only the naturalnessof the speech,but alsothe easewith which it can be understood. PACS numbers:43.70.Dn, 43.70.Gr, 43.70.Ve

INTRODUCTION

know how the stress pattern and rhythm constrain the

To communicate a thought, we put words together end-to-end to form a sentence which expresses that thought. When the sentence is conceived, its constituent words are as distinct as beads on a string. But when

the sentence is realized as speech, the words get smeared together so that the junctures or boundaries

between words become indistinct. The lips, jaws, and tongue blend the sounds of the words together as we say one word while anticipating the next. So when we listen to speech, we are faced with the formidable task of breaking up a stream of sound into

the words that the speaker intended. How do we as listeners do this? Fortunately, the speaker--as if to compensate for the smearing--puts certain signals into the speech that enable us to recover the words.

For ex-

ample, the/t/'s at the beginningof wordshavea burst of noisethat is usuallymissingfrom/t/'s thatoccureither in the middle or at the end of words.

But the entire

bur-

den of assuring good comprehension does not rest solely on the speaker. As listeners, we apply what we know about the way English is spoken to help us hear the words.

We report here some new findings about how the nature of English constrains a person to speak in a certain manner, and how the listener's knowledge of these constraints

enables

and the next begins.

him

to infer

where

one word

ends

'It is well known that each sentence

is spokenwith a certain stress pattern and rhythmz that depend, among other things, on the words that make up the sentence.

What

we have found is that listeners

a)Thispaper is an expandedversionof a talk presentedat the 92nd meeting of the Acoustical Society of America, Diego, November 1976.

234

J. Acoust. Soc.Am.63(1),Jan.1978

San

possible parsing of a sentence into words, and that they use this knowledge to hear where the words begin and end. These findings came from four experiments that used a special type of speech described next. I. REITERANT

SPEECH

We expectedthe prosodic features Of speechto be rather weak cues or signals for word perception relative to other known, powerful cues. So our strategy was to use a special type of speech which eliminated the powerful cues and allowed the effects of weaker cues on word perception to be observed. By far the most powerful and obvious cue for word

perception is the word itself. We hear words because some sound sequences have meaning for us while others do not. In short, we hear words because we know words. This would explain why a single English word heard amidst the jabber of an unknownforeign language often stands out like an oasis in a desert.

Constraints on the order and special features of speech sounds are another set of powerful cues for word perception. Certain sound sequences or clusters can occur only if separated by a syllable or word juncture. And some sounds have variants that occur only at

the beginning, middle, or end of a word (like the mentioned earlier). Both of these powerful cues for word perception were eliminated by using what we have called reiterant speech.

This is speech obtained when the same syllable,

namely/ma/, is substitutedfor every syllableof a meaningful sentence. For e,xample, "Mary had a little lamb" may be spokenas "Mama mama mama ma." We should note here that any syllable could be used to get reiterant

speech.

0001-4966/78/6301-0234500.80

(D1978Acoustical Society of America

234

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 158.42.28.33 On: Mon, 01 Dec 2014 11:22:09

235

L.H. Nakataniand J. A. Schaffer:Prosodiccuesto word perception

TABLE I. Examples of phrases with different stress patterns in their sentence contexts.

Stress pattern 11 1

English phrase and sentence The malformed nose was her claim

1 11

when 12 1

1 21

lO

1

1 01

Ol

I

I

12

1 10

to fame.

The beef potpie was bubbling it came

out of the oven.

The foolproof lock was putty in the hands of the burglar. The lor•ff stampede left dust covering the town. The noisy dog kept everyone up all night. The bold design caught and kept everyone's attention. The

remote

stream

was

perfect for fishing. The bright campfire made the group feel quite secure. The near future is not yet determined for her.

Reiterant speech was used by Carlson et al. (1973) to study the rhythm and pitch prosody of Swedish. They found that listeners could transcribe the stress pattern and also hear word boundaries reliably in reirerant speech. Their results anticipate the results of two of our experiments. Reiterant speech was brought to our attention by Mark Liberman who used it to study

the rhythm prosodyof English (Liberman and Streeter, 1978).

Just as a song sungwith "la's" preserves the rhythm, tempo, pitch, and dynamics of the song, we expected the reiterant speech to preserve the prosody of the English sentence that was mimicked. The reiterant speech obviously eliminated the powerful meaning and sound cues for word perception. Evidence that the reiterant speech preserved the prosodic cues we wanted to study will be presented later when we describe the experiments.

Reiterant speech also had the important benefit of making regularities in the prosodic features more apparent. The prosodic features of speech are influenced not only by the words and syntax of a sentence, but also by the sounds themselves. Some sounds are intrinsically short or long, soft or loud, low or high pitched

(see Lehiste, 1970). The variations that the sounds cause in rhythm, amplitude, and pitch add noise, in a statistical sense, to the perception and measurement of

these prosodic features. By using the same/ma/syllable everywhere in a sentence to eliminate variations

235

mama ma was perfect for fishing"). Each sentencehad the following constant syntax: article + adjective + noun

+ verb phrase; this assured that the prosody of the phrases was determined primarily by the stress pattern and syllable count of the words with no confounding influence due to variations

Phrases with all possible stress patterns were re-

cordedby seventalkers (four female and three male). The

talkers

were

not to reduce

the vowel

on

nal handlingsystem (Nakatani, 1977) and used as stimuli in the experiments that follow. II. STRESS

TRANSCRIPTION

The prosody of a reiterant adjective-noun phrase was determined primarily by the stress pattern of the words mimicked.

Yet we could not assume

that the actual

stress pattern realized in a phrase was as transcribed in a dictionary because of dialectal and idiolectal differences, and because of uncertainty about how well talkers could produce reiterant speech. So for the first experiment, we asked phonetically trained listeners to transcribe the stress pattern of the phrases. On the basis of their transcriptions, we selected phrases for use in subsequent experiments. Method

We had listeners transcribe

the stress for only the

disyllabic adjectivesand nouns •' in the reiterant phrases because we assumed that the monosyllabic adjectives and nouns had primary stress by default. None of the reiterant phrases had an adjective-noun sequence that

formed a compoundnoun(e.g., "tennis match") because the noun in such a compound has secondary rather than primary stress.

On each trial of the transcription experiment, the listeners heard a phrase repeated three times at 1-s

intervals; a brief warning tone preceded the first phrase. The listeners .could begin their stress transcriptions soon as they heard the first phrase.

as

The trials were organized into adjective blocks and noun blocks. In an adjective block, only phrases with a disyllabic adjective and monosyllabic noun were played. The listeners were told to transcribe the stress on only the first two syllables of each phrase; these syllables corresponded to the mimicked adjective. An equivalent procedure was followed in the noun blocks.

As a preliminary to each block, all the phrases oc1 curring in the block were played once each at •-s intervals to familiarize

served.

ferent

I. Short sentences (e.g., "The remote stream was perfect for fishing") were read with/ma/substituted for every syllable of the adjective-noun phrase (e.g., "The

instructed

syllables with no stress. The reiterant phrases were excised from the sentences with a computer-aided sig-

in the speech sounds, the underlying regularities in the prosodic features of speech could be more readily obInstead of entire sentences, the experiments used simple trisyllabic adjective-noun phrases that were reiterated in the context of sentences. Examples of the phrases and their sentence contexts are given in Table

in syntax.

random

order

the listeners

with the talker.

A dif-

was used for the familiarization

trials and the block of transcription immediately.

trials that followed

The listeners heard the adjective block and then the noun block of one talker before going on to the next talker. An adjective block consisted of 17 phrases and a noun block consisted of 16 phrases, for a total over seven talkers of 231 phrases.

J. Acoust. Soc. Am., Vol. 63, No. 1, January 1978

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 158.42.28.33 On: Mon, 01 Dec 2014 11:22:09

236

L.H. Nakataniand J. A. Schaffer:Prosodic cuesto word perception

nor any other adjective was transcribed as such. This finding is consistent with the stress-shift rule in English which causes the most prominent stress in an adjective to shift to an earlier stressed syllable in the adjective when the prominent stress normally falls on

Noun

Adjective l

2

I

20



80

r•

60

I

4.0

%

2o



80

the final syllable of the adjective and the following noun begins with a stressed syllable. This rule had two effects. One was to preclude adjectives with a 21 stress pattern. Another was that for adjectives with an 11 stress pattern, the stress on the first syllable was the more prominent, as indicated by the diffuse distribution of stress transcriptions for the second syllable.

I 2



2

'• 6o

•= 4.0

For nouns with an 11 stress pattern, the reverse was observed. The diffuse distribution of stress transcriptions for the first syllable suggests that its primary stress was slightly diminished relative to the primary stress on the second syllable. This reversal is consistent with the hypothesis that English is a stresstimed language with a tendency for primary stress to occur on alternate rather than successive syllables (see

o



2o

r•

80



60

I

I

0



236

0

2o

E• 80

Allen, 1972, for a brief review of stress-timed rhythm).

6O

It is clear that the tendency for alternation, if present, is weak. Otherwise, the primary stress on nouns with a 12 stress pattern would have shifted to the second syllable, and the 12 noun box would have been empty in

2

I

o

0

Syllable I

2

Syllable 2

Stress

Syllable I

Levels per Syllable

Syllable 2

Fig.

,

FIG. 1. Distribution of stress transcriptions for each syllable of disyllabic adjectives and nouns in reiterant speech. Each adjective and noun was spoken by seven talkers and transcribed by ten listeners, so each distribution represents 70 transcriptions. The numbers 1, 2, and 0 denote primary, secondary, and null stress, respectively. The number pairs 11, 12, etc., denote the stress pattern common to an adjective

1.

It was evidently easy to pick out the syllable with the

dominant stress. In eight out of nine cases (exception: 11 noun), the syllable with the dominant stress was transcribed with primary stress in more than 65 out of 70 transcriptions. This suggests that our transcription data were fairly reliable. Carlson et al.

(1973) also found that two levels of stress were reliably

and a noun.

transcribed in reiterant eight syllables. Ten listeners

who were

familiar

with the notion

of

stress served as subjects. Their experience ranged from undergraduates who had taken an introductory phonetics course, to a linguist whose dissertation dealt with

stress.

The listeners

were told that "...

two syllables must have primary

at least one of the

stress.

The other

may have primary,

secondary, or no stress."

were

not to listen

also told

an indication

"...

for vowel

They

reduction

as

As mentioned earlier, the stress transcriptions were obtained, not because of our interest in stress perception per se, but rather to provide a rational and objective basis for selecting stimuli for the word perception experiments. Figure 1 shows that we had reasonable exemplars of nine stress patterns (the 11 noun was marginal but nevertheless included). So 63 phrases--nine phrases best exemplifying the nine stress patterns for each

of seven

talkers--were

selected

in the remaining three experiments

of no stress."

III. PARSING

B. Results

Based on the transcription data, we selected nine phrases for each talker with adjectives and nouns that best exemplified the stress patterns shown in Fig. 1. Primary, secondary, and null stress levels are denoted by the numbers 1, 2, and 0, respectively, in this paper. Each box of Fig. 1 shows the distributions of stress transcriptions for syllables 1 and 2 of the disyllabic adjectives and nouns of these selected phrases. No disyllabic adjective was transcribed with a 21 stress pattern, as indicated by the empty box in Fig. 1.

The adjectives "champagne,.... unsafe," "antique," and "unkempt" were candidates for a 21 stress pattern according to their dictionary markings,

Swedish sentences with up to

but neither they

If listeners

OF NATURAL

for

use as stimuli

on word perception.

SPEECH

UTTERANCES

could tell whether the adjective was disyl-

labic and the noun monosyllabic, or vice versa, in the reiterant phrases, then we could infer two things: In speech production, some information about the words that make up an utterance must be encoded in the prosody of the utterance. And in speech perception, listeners must be able to decode this information in the prosody to help recover the words of the utterance. Of course, the actual English words mimicked by a reiterant phrase

can't be recovered, but the prosody of the phrase may enable listeners to hear' where the word juncture occurred, and thus enable them to parse the phrase into two pieces corresponding to the words mimicked. We found that listeners could indeed parse reiterant phrases

J. Acoust. Soc. Am., Vol. 63, No. 1, January 1978

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 158.42.28.33 On: Mon, 01 Dec 2014 11:22:09

237

L.H. Nakataniand J. A. Schaffer: Prosodiccuesto word perception

237

into "words," and that the stress pattern of the phrases

Ambiguous

ß

determined

how easily they were parsed.

Method

In the parsing experiment, reiterant phrases were played to listeners who judged whether they heard each

phrase as "ma mama" or "mama ma." The experiment consisted

of two test blocks

of 63 trials

each.

In each

block, one of the 63 reiterant phrases selected on the

basisof the stresstranscriptionexperimentwasheard on each trial. random

orders

Un, ambiguous,

The phrases were presented in different in the two blocks.

Each trial began with a brief warning tone foilowed by three presentations of a given test phrase at 1-s intervals.

The listeners

were

asked to decide

whether

the

reiterant phrase mimicked a disyllabic adjective + monosyllabic noun, or a monosyllabic adjective + disyllabic noun. On the answer sheets, which had three

"ma" syllables separated by spaces for each trial,

the

listeners indicated their parsing decision by putting a

slash after the first (ma/ma ma) or second syllable (ma ma/ma). They were free to make their decision after the first presentation of the phrase. The listeners were told how the reiterant phrases were produced, and the constant syntactic context ofthe sentences from which the phrases were excised. We gave two groups of subjects different types of

practice trials.

The first group (38 subjects) got 63

practice trials with English feedback so that the subjects could learn the task quickly. In the practice trials, the actual English words mimicked by the reitcrant phrase occurred shortly after the third presentation of the phrase to indicate the correct parsing of the phrase. In other respects, the practice trials were like the test trials.

The practice phrases were equiv-

alent to, but not the same as, the phrases used on the test

I•

trials.

Because of the possibility that the practice trials with

Mean Percent

Correctly

Parsed

with

99% C.I.

FIG. 3. Parsing scores for natural speech phrases by listeners who got practice trials like the actual test trials without English words to indicate the correct parsing. The means and 99% confidence intervals (CI) were for 45 listeners who heard each phrase spoken twice by seven talkers.

English feedback mighthav•trainedsubjects to perform an inherently nonlinguigtic task, we ran a second group

(45 subjects) without English feedback. This group got only 18 (instead of 63) practice trials. For each of the nine stress patterns, two exemplary phrases that were the easiest for the first group to parse were selected for the practice trials. The practice trials had

exactly the same format as the test trials; the subjects were not told that the first 18 trials of the experiment were for practice. Subjects in both groups were high school students who were paid for their participation. B. Results

Figures 2 and 3 show how well the reiterant phrases were parsed by the two groups. Both median and mean scores were computed, and both gave qualitatively the same results. As will be seen shortly, the distributions of the scores for some of the phrases were bimodal and even trimodal. For such distributions, the median scores would have been more appropriate, yet mean scores are shown in Figs. 2 and 3 so that the variability of the scores can also be shown. For both groups, all the phrases were correctly parsed signifi-

Ambiguous

cantly better than chance (50%), but the stress pattern made some phrases significantly

easier to parse than

others.

We divided the phrases into two sets according to how I Unambiguous their stress pattern constrained their parsing. The

three phrase pairs in the upper box of Figs. 2 and 3 are ambiguous because their parsing is ambiguous given Mean Percent

Correctly

Parsed with 99% C.I.

FIG. 2. Parsing scores for natural speech phrases by listeners who got practice trials with actual English words following the reiterant phrases to indicate the correct parsing. The means and 99% confidence intervals (CI) were for 38 listeners who heard each phrase spoken twice by seven talkers.

only the stress pattern. For example, the 10 1 ("tasty food") and 1 01 ("bold design") phrase pair have the same stress pattern but different parsings. The phrases in the lower box of Figs. 2 and 3 are unambiguous because their parsing is uniquely determined by their stress pattern. These phrases have secondary or null stress on either the first or last syllable; such a syllable could

J. Acoust. Soc. Am., Vol. 63, No. 1, January 1978

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 158.42.28.33 On: Mon, 01 Dec 2014 11:22:09

238

L.H. NakataniandJ. A. Schaffer:Prosodic cuesto wordperception

only be part of a disyllabic word since all monosyllabic words had primary stress.

238

20 18

In general, the unambiguous phrases were signifi-

16

cantly easier to parse than the ambiguousphrases.3 It

14

12

is apparent that listeners knew how the stress pattern of the unambiguous phrases constrained their parsing and used this knowledge to parse these phrases easily.

108 6 4 2

The stress pattern was not a sufficient cue for parsing the ambiguous phrases, yet the listeners parsed even these phrases significantly better than chance. There must have been some prosodic feature that differentiated these ambiguous phrases pairs and enabled listeners to parse them correctly. The experiment using synthetic speech described in the next section showed what these prosodic cues were. Our

results

are

consistent

with

those

of Carlson

et

al. (1973) who found that a selected subset of their phrases were parsed reliably. They did not report whether the stress pattern influenced how easily the phrases were parsed,

0

• o

22 20

•0

18



16

.,-•



14

•0

12 10

:

4

Z

2 0 22 20 18 16

14 12 10

We inferred that the listeners parsed the unambiguous phrases so easily because they knew some rules about how English is spoken. We suspected, however, that not every listener knew or applied the relevant rules. Listeners who knew the rules should have parsed the un-

8 6 4 2 0

6

?

8

9

10 11 12 13 14 6

Number

?

8

Correctly

9

10 11 12 13 14 6

?

8

9

10 11 12 13 14

Parsed out of 14 Possible

FIG. 5. Distributions of parsing scores for six ambiguous phrases (upper six boxes) and three unambiguousphrases (lower three boxes) by 45 listeners who got practice trials, without English feedback (i.e. , no indication of correct parsing).

24

•o 18

The highest possible score was 14 because the listeners heard each phrase spoken twice by seven talkers. The few scores

16

14

that fell

below 6 are not shown.

lO

6 4

ambiguous phrases almost perfectly, while listeners who did not should have parsed these phrases no better than the ambiguous phrases. If there were two types of listeners, then the distributions of the parsing scores for the unambiguous phrases shown in the bottom three

o



22••.

o 20•0 18•/•

14

boxes of Figs. 4 and 5 should be bimodal. Our expecta10

D

Z

tion was confirmed dramatically for some of the unambiguous phrases but only marginally for others. The results were not consistent for the experimental groups

4

2

with and without

0

18

16 14

The distributions

that were

the practice trials.

12 10

We did not pursue the question of how listeners in the

8

two groups differed.

6 4

to parse reiterant

2 0

6

?

8

9 10 11 12 13 14 6

Number

FIG. 4.

?

Correctly

8

9 10 11 12 13 14 6

?

8

9 10 11 12 1314

Parsed out of 14 Possible

Distributions of parsing scores for six ambiguous

phrases (upper six boxes) •d three u•mbiguous phrases (lower three boxes) by 38 listeners who got practice trials wit• Eng•sh feedback to indicate correct parsing. est possible

score was 14 because the listeners

phrase spoken twice by seven talkers. fell

feedback:

strongly bimodal for one group were only weakly so for the other group. It's hard to see how these differences could be explained by the difference in feedback during

22

below

The high-

their linguistic competence, then the reiterant phrase parsing task may prove useful as a diagnostic tool for studying how well a person's language faculty is developed.

We summarize the outcome of the parsing experiment as follows'

heard each

The few scores that

6 are not shown.

If differences in people's ability

phrases reflect some differences in

(1) Most, but not all, listeners usedstress pattern as a prosodic cue for word perception.

J. Acoust.Soc. Am., Vol. 63, No. 1, January1978

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 158.42.28.33 On: Mon, 01 Dec 2014 11:22:09

239

L. H. Nakataniand J. A. Schaffer:Prosodiccuesto word perception

(2) When the stress pattern was not a sufficient cue

239

HYBRID

SPEECH SYNTHESIS

for word perception, listeners must have used other prosodic cues for word perception because the ambiguous phrases were parsed better than chance.

AMPLITUDE

a•o

PITCH

•OfflOa

PARENT

IV. PARSING PHRASES

OF SYNTHETIC

SPEECH

HYBRID

SPECTRUM

RHYTHM

The reiterant phrases with ambiguous stress patterns were hard to parse, but they were nevertheless all parsed better than chance. Because the stress pattern

wasn't a sufficient cue for parsing these phrases, there must have been other prosodic features which differentiated phrases with the same stress pattern but different parsings. The parsing experiment using synthetic speech hybrid phrases was designed to find out what these features were, and how strong they were as cues for word perception. Using a technique called hybrid

AMPLITUDE -•--

SPECTRUM

,

i



speech synthesis, we found that rhythm was a prosodic cue for parsing the ambiguous phrases, and that pitch and amplitude were not prosodic cues for parsing.

A. Hybrid speechsynthesis

perception.

Suppose that we have two phrases A and B with the same stress pattern but different parsings, and that A because

of differences

in their

prosodic features that reflect their parsing. We can create a new phrase X by adjusting the durations of the segments of A to match the durations of the corresponding segments of B. X is a hybrid phrase in the sense

that some of its prosodic features (everything but

rhythm) comefrom the parent phraseA and some (only rhythm) from the parent phrase B. To assessthe strength of rhythm as a cue for word perception, we see if listeners parse X like A or B. How often X is parsed like A or B is a measure of how weak or strong, respectively, rhythm is as a cue for word perception. Of course, not only rhythm but other prosodic features could be studied in the same way.

We can study interactions among features by creating hybrid phrases with all possible combinationsof prosodic features from the two parent phrases. Two or more' features

interact

if their

m

i

i

.•

RHYTHM

• -- •



AM•ITUDE

• / •



PITCH

a•m 0 •a

Hybrid speech synthesis (Olive and Nakatani, 1974) is a general experimental technique for assessing the strength of, and interactions among, a set of speech features in influencing some speech perception task. We describe the technique as it applies to the problem at hand--namely, the study of prosodic cues for word

and B sound different

PITCH

HYBRID

combination

is a weaker

or

....... m

• o

m

a

m

RHYTHM

o

FIG. 6. Schematic diagram of the hybrid speech synthesis technique for generating hybrid phrases having a combination of features taken from two parent utteranees• The hybrid

phraseswere synthesizedfrom parameters representingthe natural features of the parent phrases; the parameters were derived directly from natural speech by linear predictor analysis. In this example, the hybrid phrase has the amplitude and pitch features of the upper parent, and the spectrum and rhythm features of the lower parent.

features.

This

in turn

is assured

if the features

of the

parents are extracted from natural speech by some procedure such as linear predictor analysis.

The advantageof using hybrid speechsynthesis to assign values to features becomes more apparent as the features become more complex. With simple features

(e.g., voice onsettime), it is usually easy to assign values to the features within the range of values found in natural speech. But with complex features (e.g.,

pitch contour)which require not just a single value but an entire table of values for each feature, the assignment of natural values to the features becomes problematical.

In short, hybrid speech synthesis enables uS to study

stronger cue than one would expect from their strength

complex speech features without specifying beforehand

as cues when they occur alone or in combination with

modelsto assignreasonablevalues to the features. By

other

circumventing models, we obtain results free of contamination due to defects in the models. With hybrid speech synthesis, we can observe first, then model.

features.

The novel aspect of hybrid speech synthesis is not the factorial design, but the assignment of "100% natural" values to the features. Instead of using some model or

ad hocprocedureto assignvalues--possiblyunspeechlike--to the features, hybrid speech synthesis assures that, by taking features from the parents, only values that occur in natural speech are assigned to the features. This presumes, of course, that the parents have natural

B. Method

As shown in Fig. 6, hybrid speech synthesis was

usedto studythe prosodic features of rhythm, pitch, amplitude, and spectrum as cues for parsing the ambiguous phrases. A pair of ambiguous phrases with the

J. Acoust. Soc. Am., Vol. 63, No. 1, January 1978

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 158.42.28.33 On: Mon, 01 Dec 2014 11:22:09

240

L.H. NakataniandJ. A. Schaffer:Prosodic cuesto wordperception

same stress pattern but different parsings (e.g., 12 1 and 1 21) served as the two parent phrases. Hybrids were generated by taking the rhythm, pitch, amplitude, and spectrum from either parent phrase in all possible combinations. The hybrid in Fig. 6 has the amplitude and pitch of the upper parent, and the spectrum and rhythm of the lower parent. Since each of the four features could come from either parent, each set of parents

generateda family of 16 (= 2x 2x 2x 2) offspringphrases. Of the 16, 14 were true hybrids, and two were exact copies of the parents.

Hybridspeech •ynthesis wasdonevialinearpredictor analysis and synthesis(Atal and Hanauer, 1971) so that the prosodic features of speech could be accessed and manipulated independently. The linear predictor parameters consisted of pitch and amplitude parameters and 12 pseudoarea parameters representing the spectral feature. The rhythm feature was represented by frames at 10-ms intervals that specified values forthe other 14 parameters. The parameters representing the natural features of the parent phrases were obtained directly from natural speech by linear predictor analysis of the ambiguous phrases of the first parsing experiment. An interactive program enabled us to take features from either parent, combine them to create hybrid parameter sets, and synthesize hybrid phrases from the parameter sets by linear predictor synthesis. The program also enabled us to synchronize the prosodic features taken from parents with different rhythm. As indicated in Fig. 6, the portions of each prosodic feature

corresponding to the/m/and/a/segments were linearly stretched or compressed when necessary to synchronize the rhythm of the feature with the rhythm of the hybrid being generated.

240

parent, and the other with the "mama ma" parent as the anchor phrase on each trial. Each parent-offspring pair was played in both orders so that each tape con-

sisted of 192 (=96x 2) test trials plus 30 practice trials before the test trials. The five pairs from each talker pairs from those where the phrases were very similar, very distinct.

practice trials consisted of that spanned the range of anchor and comparison to those where they were

Two groups of subjects were run for each tape with different parsing instructions. The groups were told to

judge which member of each pair soundedlike "ma mama" (one group) or "mama ma" (the other group). They were all told how the reiterant phrases were produced, and told of the constant syntax of the sentence context from which they were excised.

The subjects were 44 high school students, 11 in each of the four groups, who were paid for their participation.

In summary, the experiment was a product of five

speech variables (stress pattern, rhythm, pitch, am-

plitude, and spectrum) and four control variables (talkers, presentation order, anchor phrase, and parsing instructions). There were three stress patterns: 111, 121, and 101. For each stress pattern, we had two sets of parents from two different talkers. Each set of parents generated a family of 16 hybrid phrases with all possible combinations of rhythm, pitch, amplitude, and spectrum. This gave us 96 hybrid phrases in six families with 16 offsprings per family. Each off-

spring phrase was paired with its "ma mama" parent (one stimulus tape) or its "mama ma" parent (another

stimulus tape); the parent was the anchor phrase on each trial. Each parent-offspring pair was presented in both orders, so each tape had 192 test trials preceded by 30

The hybrid phrases were generated by the following design. The ambiguous phrases had three stress patterns: 111, 121, and 101. For each stress pattern, two talkers were selected whose phrases had the best parsing scores in the first parsing experiment, subject

with either "ma mama" or "mama ma" parsing instruc-

to the constraintthat six talkers were represented.4

C. Results

For each talker, the pair of ambiguous phrases with the appropriate stress pattern were converted by linear predictor analysis from natural speech to parametric representations of the parent phrases for hybrid speech

synthesis. Since there were six sets of parents (a set for each of two talkers for each of three stress pat-

terns), and each set generated 16 hybrid phrases as offsprings, a total of 96 (=6x 16) hybrid phrases were generated.

Because the ambiguous phrases were hard to parse, a palred,comparison paradigm was used for the hybrid phrases to make parsing easier for the listeners. Each trial of the experiment consisted of the paired presentation of an anchor phrase and a comparison phrase. The anchor phrase was always one of the parent phrases, and the comparison phrase was always one of the offsprings of the parent phrase that served as the anchor for that pair. Half a second separated the parent and offspring phrases, and a 3-s response interval followed the pair. Since either parent could serve as an anchor, two stimulus tapes were made: one with the "ma mama"

prac,tice trials. Eachtapewas playedto two groups tions.

Each of the four groups had 11 subjects.

The results of the experiment are shown in Fig. 7. The mean percent difference in parsing was computed as follows. Each family of 16 offspring phrases was divided into eight minimal pairs of hybrid phrases that differed only in the values of one feature--rhythm, for

example. One hybrid phrase (call it A) of eachpair had the rhythm of the anchor parent, and the other hybrid

phrase (call it B) had the rhythm of the other parent. If rhythm was a strong cue for parsing, then A should have sounded like the anchor parent, and B should have sounded like the nonanchor parent. This means that the parsing of A would not have been discriminated from that of the anchor phrase, and A would have a parsing discriminability score not much better than chance

(50%). By contrast, the parsing of B would have been easily discriminated from that of the anchor, and B would have a high parsing discriminability score. The difference between the high score for B and the nearchance

score

for A was

taken

as a measure

of the

strength of rhythm as a cue for parsing the ambiguous phrases. The mean of these difference scores averaged

J. Acoust. Soc. Am., Vol. 63, No. 1, January 1978

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 158.42.28.33 On: Mon, 01 Dec 2014 11:22:09

241

L.H. NakataniandJ. A. Schaffer:Prosodic cuesto wordperception

-60 c.j 50 o•

[Rhythm

[ P•t, ch

30

• ao •

10

111

121

101 Stress

111

121

101

Pattern

FIG. 7. Parsing difference scores for minimal pairs of hybrid phrases that differed only in one feature: either rhythm, pitch, amplitude, or spectrum. The stress patterns are those that characterized the ambiguous phrases. Each four-bar cluster represents the results of the four groups with different

anchors and parsing instructions. dence intervals

The means and 99% confi-

(CI) are for 11 listeners.

over the eight minimal pairs within a family, over the two families per stress pattern, over two presentation orders, and over 11 subjects is plotted in Fig. 7 for each stress pattern and group. Similar mean difference in parsing scores computed for the pitch, amplitude, and spectrum features are also plotted in Fig. 7. It is obvious from Fig. 7 that rhythm was the only prosodic feature that influenced the parsing of the hybrid phrases. An analysis of variance of the discriminability scores showed that none of the prosodic features interacted significantly.

The pitch and amplitude features did not influence the parsing of the hybrid phrases. This result was expected from examinations of the pitch and amplitude contours of the parent phrases. Neither feature showed much difference in contour between the two parents for any of the stress patterns or talkers. The stress pattern and parsing of a phrase apparently did not introduce perceptible perturbations in the pitch contour. The global shape of the pitch contour, which depends pri-

marily on syntax, was relatively constantfor the phrases because their syntax was kept constant. We can be fairly confident that the results for pitch and the other prosodic features were not confounded by uncontrolled variations

in syntax.

The fact that spectrum had no influence on parsing is

clear from Fig. 7, but we can't generalize this result to normal speech because of the nature of reiterant speech. Experiments that have used normal speech instead of reiterant speech have demonstrated amply that spectral features are very strong cues for word per-

ception (Lehiste, 1960; Lehiste, 1964; O'Connor and

Tooley, 1964). As mentionedat the outset, reiterant speech was used because we thought it would be free of word-dependent variations in certain speech sounds

241

which are known to be strong cues for word perception. Also, vowel reduction was minimal in the reiterant speech because we instructed our talkers to avoid reducing unstressed vowels. Besides, vowel reduction is a cue for stress perception which was of no help in parsing the ambiguous phrases. Any residual vowel reduction that may have been present in the reiterant phrases could not have influenced the parsing of the hybrid phrases. The lack of influence of spectrum on word perception is a positive rather than negative result because it confirmed our expectation that reiterant speech was free of word-dependent variations in the speech sounds. This interpretation must be taken with some caution since there is a slight tendency for the spectrum to influence the parsing of phrases with a 111 stress pattern. The

influence

of the anchor

and instruction

control

variables was restricted to what seems a strange deafness to rhythm that afflicted only the group with the

"ma mama" anchor and "ma mama" instruction (the rightmost bar in the four-bar clusters). We can't explain why this particular combination should have caused the group any special difficulty.

In summary, it is clear that rhythm was the only prosodic cue for word perception in the ambiguous reirerant phrases. Pitch and amplitude were not prosodic cues for word perception in the reiterant phrases. We are fairly confident that these results will generalize to normal speech when such tests are made. The finding that spectrum was not a cue for word perception was specific to reiterant speech, and should not be generalized to other types of speech. V. MEASUREMENT

OF RHYTHM

We found that rhythm was a strong cue for word perception. But we still had to find out what specific aspect of rhythm differentiated two ambiguous phrases with the same stress pattern but different parsings. We measured the durations of the syllables of reiterant phrases and found that phrases with a long first syllable were heard to begin with a monosyllabic word. The measurements also revealed the effects of stress levels and patterns on syllable durations and, hence, on speech rhythm. A. Method

The durations of the/m/and/a/phonemes

were mea-

sured in the 63 reiterant phrases used in the first pars-

ing experiment. In each syllable, the/m/duration was the duration of the nasalized portion, and the/a/duration was the duration of the vocahc portion. portions were easily visible in the waveforms

Since these of the

phrases, the duration measurements were made on the waveform display of a computer-aided signal handling

program (Nakatani, 1977). B. Results The duration

measurements

were

normalized

every talker had the same speaking rate.

so that

The average

J. Acoust. Soc. Am., Vol. 63, No. 1, January 1978

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 158.42.28.33 On: Mon, 01 Dec 2014 11:22:09

242

L.H. Nakataniand J. A. Schaffer:Prosodiccuesto word perception

Syllable 3]

Syllable 2]

Syllable 1I

m

11

•••1

11 12

1

352 =- I 1 1

1

11

12

1

ale--= 1 1 1

a4•121 a•

1 21

1 21

10

a

111

1

-------•74 --=1 O1 ..

Ol

•••1

12

100

200

300

400

100

300

Normalized

syllable durations

syllables of reiterant speech

stress patterns The/m/ and/a/ durations are shown separately for each syllable, but the number

in the bar

is the total

dura-

tion of the syllable. -The means and standard error

(SE) intervals

are for

talkers.

12 ==1 10

•4o ==1 10 200

8.

for phrases with different

seven

• ,

1 21

1

1

0

FIG.

of the/ma/

a7[•l0 1 a•l O1

1

1 '01

242

40'0

100

200

300

400

500

Normalize•l SyllableDuration(msee)with S.E.Interval

length L of all 63 phrases was computed, and the lengths of the nine phrases for each talker were linearly scaled to make their average length equal to L. The normalized syllable durations are plotted in Fig. 8. The top six lines of bars in Fig. 8 show the duration measurements for the three pairs of ambiguous phrases.

Three previousl.y observedword-dependenteffects on

dividing the noun and verb phrases in the mimicked sentences always fell just after the final syllable of the reiterant phrases. The length of the final syllable may be

a cue for phrase boundary (Streeter, 1976), but it was not a cue for parsing the ambiguous phrases.

2. Word-final syllable elongation

speech rhythm were considered as possible cues for

word perception. They are (1) monosyllabicword elongation, (2) word-final syllable elongation,and (3) wordinitial .

consonant elongation.

1. Monosyllabic word elongation Monosyllabic words were longer than equivalent syl-

lables in polysyllabic words when spokenin carrier phrases (Lehiste, 1972; Oller, 1973; Klatt, 1973), but not in connecteddiscourse (Umeda, 1975). Monosyllabic word elongation occurred in the ambiguous re-

iterant phrases, and was probably the primary cue for word perception in these phrases. In each ambiguous phrase pair with the same stress pattern, the mono-

syllabic adjective in the "ma mama" phrase was longer by about 50 ms, on the average, than the first syllable of the disyllabic adjective in the "mama ma" phrase. Listeners parsed the ambiguousphrases correctly because they apparently heard a monosyllabic word at the beginning of phrases with a long first syllable.

The final syllables of words were longer than equivalent word-initial and word-medial syllables in some

studies (Lehiste, 1972; Oiler, 1973; Klatt, 1973), but again not for connecteddiscourse (Umeda, 1975). The ambiguous phrases did not show word-final syllable elongation. For example, the second syllable, which is word-final in the 11 1 phrase and word-initial in the 1 11 phrase, was about the same length in both phrases. Over all three ambiguous phrase pairs, the difference averaged only 5 ms between word-final and wordinitial syllables. Word-final syllable elongation was certainly not a cue for parsing the ambiguous phrases.

3. Word-initial consonant elongation

Word-initialconsonants werelongert•anequivalent consonantsin word-medial and word-final positionsin several studies (Oiler, 1973; Klatt, 1974; Umeda, 1975). This effect was noticeable in the/m/consonant

tions in the ambiguousphrases.

dura-

Word-initial/m/'s

Mon'osyllabic wordelongation wasnotobserved inthe •were about11 mslonger,ontheaverage,thanwordfinal syllable of the ambiguous phrases. The length of the final syllable showed no systematic effect of the syllable being either a monosyllabic word or the second syllable of a disyllabic word. Perhaps the elongation

medial/m/'s

was neutralized by prepausal elongation (Martin, 1970;

nitudes of the two effects, monosyllabic word elongation must have been the primary cue for parsingthe ambiguous phrases, and word-initial consonant elongation might have been a secondary cue.

Klatt, 1975) which made the final syllables noticeably longer than the first and second syllables.

Prepausal

elongation occurredbecause themajorsyntactic boundary

in the secondand third syllables. The

elongation of the word-initial consonant, although very consistent, was only about a fifth of the elongation of

monosyllabicWords. Judgingfrom the relative mag-

J. Acoust. Soc. Am., Vol. 63, No. 1, January 1978

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 158.42.28.33 On: Mon, 01 Dec 2014 11:22:09

243

L.H. Nakataniand J. A. Schaffer:Prosodiccuesto word perception

4. Stressedsyllable elongation Because the stress pattern was such a strong cue for parsing the unambiguousphrases, we sought and found effects in the reiterant

speech that were correlated

with stress pattern. The/ma/syllable

lengthened

about 50 ms as the stress level rose from 0 to 2, and ler•gthened by about 35 ms as the stress level rose from

2 to 1. s Syllable elongationfrom stress level 0 to 1, which was 85 ms (= 50 ms + 35 ms) in reiterant speech, has beenpreviously observed(see Klatt, 1976), but Klatt (1975) found only a small elongationfrom stress level

2 to 1.

243

others. She found that many segmental features (e.g., aspiration, glottal stops, etc. ) that are strong cues for word juncture occurred in some phrases but not in others, depending on their stress pattern. That is, stress pattern cued word juncture perception indirectly by giving rise to segmental features at the word junctures. In our experiment, the stress pattern had to be a direct cue for word perception because, as the parsing of the hybrid phrases showed• the reiterant phrases had no segmental cue that enabled listeners to parse the phrases. So it seems that stress pattern can be either a direct or indirect cue for word perception. Stress pattern is

From the stress-correlated syllable elongations, and from the failure to observe, by inspection, stress-correlated variations in the pitch and amplitude features, we conjecture that rhythm is a strong cue for stress

a direct cuewhen, for listenerswhoknowthe rules, it constrains the parsing to be unique. Stress pattern is an indirect cue when it gives rise to segmental cues that are strong markers of word junctures.

perception in connectedspeech. If so, then rhythm may be a sufficient cue for word perception in all the reiterant phrases, both ambiguous and unambiguous.

In summary, monosyllabic word elongation was, in all likelihood, the primary cue for parsing the ambiguous phrases, and word-initial consonantelongationwas probably a secondary cue. Further experiments are needed to find out how strong these cues are. If, as conjectured, rhythm is a strong cue for stress perception, it is possible that stress pattern can be subsumed under rhythm prosody, and rhythm regarded as the primary prosodic cue for word perception. Vl.

DISCUSSION

The discussion is divided into two parts:

The first

part discusses the major experimental findings concerning prosodic cues for word perception, and the second part discusses some methodological considerations in the study of speech prosody. A. Prosodic cues for word perception The stress pattern and rhythm of speech are prosodic cues for word perception. Pitch and amplitude are not prosodic cues for word perception. 1. Stress

The stress pattern of speech is considered a prosodic cue for word perception because unambiguous phrases were parsed much better than ambiguous phrases. A phrase was labeled unambiguous or ambiguous depending on whether its stress pattern allowed only one or two parsings, respectively, of the phrase into words. Listeners apparently knew that certain stress patterns could occur only with phrases parsed in one way, and they used this knowledge to parse easily the unambiguous phrases. Listeners apparently divided into two groups: those who did, and those who did not, use stress pattern as a cue for parsing the unambiguous phrases. The bimodal distributions of the parsing scores for the unambiguous phrases were indications of this division.

Garding (1967) pointed out how word juncture is more easily perceived with some stress patterns than with

2. Rhythm The rh•hm of speech is considered a prosodic cue for word perception because it was the only prosodic feature that influenced the parsing of hybrid phrases. It seems that monosyllabic word elongation is the primary rhythm cue for word perception, and that wordinitial consonant elongation is a secondary rhythm cue.

Rhythm may also be the primary cue for stress perception. In reiterant phrases, there were noticeable differences in the length of syllables with 0, 2, and 1 stress levels. If the stress pattern of speech is perceived by its rhythm, then only rhythm remains as the cue for word perception. The clarity with which the systematic effects of syntax, stress pattern, and parsing showed up in the consonant and syllable durations of reiterant speech was

quite striking. The measurements were reliable (i.e., had small standard error intervals) despite a small sample of only seven tokens, and despite the fact that the tokens were from different talkers and not repeated toker/s from one talker.

When one considers the amount

of work needed to get equally reliable measurements showing the same effects in normal speech from different talkers, it is easy to see why reiterant speech is efficient for studying speech rhythm in particular, and speech prosody in general. 3. Caveat

As is usually the case with laboratory studies, we face the issue of generality. Our conclusions about the prosodic cues for word perception were obtained with reiterant mimicries of trisyllabic adjective-noun phrases in a restricted syntactic context. Moreover, the listeners were told that each phrase consisted of only two content words: namely, an adjective followed by a noun. This is obviously more than listeners normally know a priori about an utterance. At the very least, this knowledge enabled the listeners to reject the hypothesis that an unstressed initial syllable in a phrase was a function

word.

We have addressed

the issue

of ,

generality directly, and have underway experiments with longer phrases, different syntactic contexts, and different

talkers

and listeners.

J. Acoust.Soc.Am., Vol. 63, No. 1, January1978

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 158.42.28.33 On: Mon, 01 Dec 2014 11:22:09

244

L.H. NakataniandJ. A. Schaffer:Prosodic cuesto wordperception

Our experiments used an admittedly exotic form of speech in a rather circumscribed listening task. Yet, we believe that the experiments have demonstrated the following facts about speech production and perception, and that these

facts

transcend

the restrictions

of the ex-

periment. First, it is clear that information about the division of a phrase into words is encoded prosodically in speech. Second, listeners can decode these prosodic cues to divide

an utterance

into

its

constituent

words.

These facts must be true to explain the regularities observed in the prosody of reiterant speech, and to explain how listeners were able to parse the phrases into words with no prior training.

1970; Klatt, 1976) were also observed in reiterant speech. These effects include prepausal elongation, monosyllabic word elongation, stressed syllable elongation, word-initial consonant elongation, and interactions of these effects with stress pattern. That these effects are so readily observed with such a small sample is en'couraging when one considers the large sample that would be necessary to observe the same effects in normal speech. But what we want to emphasize here is not so much the reliabilit-y of the measurements as their consistency. All the duration effects observed in reiterant speech are consistent with those previously observed in normal speech, and nothing that is inconsistent

B. Techniquesfor studying prosody The experimental techniques we used proved effective for studying speech prosody. The crucial element in these experiments was the reiterant speech which served admirably as a carrier of the prosodic features we wanted to study. The hybrid speech synthesis technique enabled us to study complex speech features independent of any model or theories.

1. Reiterant speech The nature of reiterant

speech and its concomitant ad-

vantages for the study of prosody have already been discussed.

We discuss here evidence in support of our ex-

pectation that reiterant speech is in fact speech with prosodic features well preserved, and with semantic and phonotactic features eliminated. We may take for granted that semantic features are eliminated, but we must show that our expectations for prosodic and phonotactic features are met. Some of the properties of reiterant speech are revealed by four aspects of our resuits: stress, parsing, rhythm, and spectrum. These results are discussed in turn in the following paragraphs. The stress levels and patterns were well preserved in the reiterant speech. We were able to find good exemplars of almost all the stress patterns expected for the trisyllabic English phrases that were mimicked. Three levels of stress were transcribed reliably: This result is consistent with the generally accepted notion that English syllables have primary, secondary, and null

stress.

244

was

observed.

All the evidence considered so far confirms that prosody is preserved in reiterant speech. Evidence confirming that phonotactic features are eliminated was the failure of spectral features to influence the parsing of the hybrid phrases. This failure means that all the

/ma/ syllables soundalike regardless of their positions in a word. That is, the reiterant speech had no special segmental cues that marked the word junctures. We don't

have evidence

to show that the vowels

in unstressed

syllables were not reduced. But since vowel reduction is a cue for null stress, and since stress is a prosodic feature, the presence of vowel reduction in reiterant speech is tolerable.

The hypothesis that reiterant speech preserves the prosody of normal speech was studied by Liberman and

Streeter (1978). They foundthat the rhythmic pattern of reiterant

speech was highly reproducible

across rep-

etitions and talkers, and that it changed systematically and in the expected manner with changes in syntax and stress pattern. They also found that the duration of a "ma" syllable was about the same (assuming constant

stress and syntax) whether the English syllable mimicked was short (e.g., "cut") or long (e.g., "freeze"). This means that reiterant speech is not unduly influenced by the phonemes of the actual English words underlying the mimicry.

In summary, of the evidence examined so far, all support and none contradicts the claim that reiterant speech has the prosody of normal speech without its meaning and sound variations. However, we have not yet done the crucial test to see if any novel findings ob-

tained with reiterant speech.will generalize to normal Reiterant phrases were parsed on the basis of their prosodic features. In all likelihood, these features were present in reiterant speech because they occur in normal speech. If this were not so, then it would be hard to imagine how different talkers could have been

so consistent in putting prosodic features into reiterant speech, and how listeners could, without training, use the features in a difficult speech perception task. So the fact that reiterant phrases were parsed at all is strong evidence that the prosody of normal speech was preserved in reiterant speech. The measurements of syllable durations suggest that reiterant speech has the rhythm of normal speech. As already discussed, many of the reliable segment dura-

tion effects observed in normal speech (see Lehiste,

speech. Until such tests are done, any advantage claimed for reiterant speech is provisional.

2. Hybrid speechsynthesis The technique and advantages of hybrid speech synthesis have already been discussed. To get another perspective on the advantages of hybrid speech synthesis, consider an alternative procedure that assigns values to the prosodic features by some model or ad hoc process. Suppose that synthetic speech utterances are generated by a complete factorial design with rhythm and pitch as factors. And suppose that the parsing task is influenced more by pitch than by rhythm. What can we conclude?

We can't take the results

at face value

because we can't be sure that the values assigned to

J. Acoust.Soc.Am., Vol. 63, No. 1, January1978 Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 158.42.28.33 On: Mon, 01 Dec 2014 11:22:09

245

L. H. Nakataniand J. A. Schaffer:Prosodiccuesto word perception

the features are realistic. Suppose, for example, that the range of values assigned to pitch and rhythm are, respectively, too large and too small relative to the range of values observed for these features in natural speech. If so, then the results may be artifacts of the experimental procedure and not facts about speech. The assignment of realistic values that span a realistic range is a severe problem when complex speech features are studied. The hybrid speech synthesis technique solves this problem by assigning to features only those values which are found in natural speech. We can be reasonably sure that the results of an experiment using hybrid speech synthesis are not confounded by artifacts due to some defect in the procedure for assigning values to features.

245

the /ma/ syllables in the reiterant phrases that correspond to the mimicked adjective and noun.

3Wecan't explainwhythe I 01 and01 1 phrasesgaveanomalous results for the two groups.

4Weomittedthe seventhtalker, whoseambiguousphraseswere parsed at barely above the chance level.

5Theseaverageestimateswere basedon comparisons between two syllables in the same position in a word but with different stress levels. The 11 I and I 11 phrases were excluded from the comparisons because the stress level of their second syllable was nebulous.

Alien, G. D. (1972). "The location of rhythmic stress beats in. English: an experimental study I," Lang. Speech 15, 72-100. Atal, B. S., and Hanauer, S. L. (1971). "Speech analysis and synthesis by linear prediction of the speech wave," J. Acoust.

Soc. Am.

Carlson, R.,

50, 637-655.

Granstrom,

B.,

Lindblom,

B.,

and Rapp, K.

(1973). "Some timing and fundamental frequency character-

'VII. SUMMARY Listeners ,were able to parse reiterant speech that mimicked a trisyllabic adjective-noun phrase into two parts corresponding to the words mimicked. Most, but not all, listeners apparently used the stress pattern to parse the phrases because a phrase was hard or easy to parse depending on whether its stress pattern was consistent, respectively, with two parsings or only one parsing of the phrase. The hybrid speech synthesis experiment showed that the parsing of a phrase was affected when its rhythmic pattern was changed, but not when its pitch and amplitude contours were changed. Duration measurements of reiterant speech syllables revealed--reliably with a small sample--many aspects of speech rhythm that were correlated with the stress pattern and parsing of the phrases. We conclude that the stress pattern and rhythm of speech are the primary prosodic cues for word perception. These results suggest that the proper specification of stress pattern and rhythm in rule synthesis would improve not only the naturalness, but also the intelligibility, of synthetic speech.

istics of Swedish sentences: data, rules, and perceptual evaluation," Speech Transmission Lab. Q. Prog. Status Rep. 4, 11-19.

Garding, E. (1967). Internal Juncture in Swedish (C. W. K. Gleerup, Lund). Klatt, D. H. (1973). "Interaction between two factors that influence vowel duration,"

J. Acoust.

Soc. Am. 54, 1102-1104.

Klatt, D. H. (1974). "The duration of [s] in English words," J. Speech Hear.

Res.

17, 54-63.

Klatt, D. H. (1975). "Vowel lengthening is syntactically determined

in connected discourse,"

J. Phon. 3, 129-140.

Klatt, D. H. (1976). "Linguistic uses of segmental duration in English: acoustic and perceptual evidence," Am. 59, 1208-1221.

J. Acoust. Soc.

Lehiste, I. (1960). "An acoustic-phonetic study of internal open juncture,

Phonetica Suppl. 5.

Lehiste, I. (1964). "Juncture," Proc. Fifth Int. Congr. Phon.

Sci.

172-200.

Lehiste, I. (1970). Suprasegrnentals(MIT, Cambridge, MA). Lehiste, I. (1972). "The timing of utterances and linguistic boundaries,"

J. Acoust. Soc. Am. 51, 2018-2024.

Lehiste, I. (1973). "Rhythmic units and syntactic units in production and perception,"

J. Acoust.

Soc. Am.

54, 1228-1234.

Liberman, M. Y., and Streeter, L. A. (1978). "Use of nonsense-syllable J. Acoust.

mimicry

Soc. Am.

in the study of prosodic phenomena,"

63,

231-233.

Martin, J. G. (1970). "On judging pauses in spontaneous ACKNOWLEDGMENTS

speech,"

J. Verb.

Learn.

Verb.

Behar.

9, 75-78.

Nakatani, L. H. (1977). "Computer-aided signal handling for

We would like to thank Peter Denes for stimulating our work on word perception in particular, and on speech prosody in general. We are indebted to Mark Liberman for bringing reiterant speech to our attention, and for numerous discussions of this work. To both, a grateful "mama !"

speech research,"

J. Acoust.

Soc. Am. 61, 1056-1062.

Nooteboom, S. G. (1973). "The perceptual reality of some prosodic duration,"

J. Phon. 1, 25-45.

O'Connor, J. D., and Tooley, O. M. (1964). "The perceptibility of certain word boundaries, "in In Honour of Daniel Jones, edited by D. Abercrombie, D. B. Fry, P. A.D. MacCarthy,

N. C. Scott, and J. L. M. Trim(Longmans, Green, London). Oiler, D. K. (1973). "The effect of position in utterance on speech-segment duration in English," J. Acoust. Soc. Am. 54, 1235-1247.

1Stresspattern refers to the patternformedby variations in the stress level (either null, secondary, or primary) of the ß syllables of a sentence. Rhythm refers to variations in the relative durations of the syllables of a sentence. Both the, stress pattern and rhythm depend on the syntax of the sentence .and its constituent words. Stress pattern, rhythm, pitch, and amplitude collectively define the prosodic features of English (Lehiste, 1970).

2Adjectiveandnoun,whenusedfor reiterant speech,refer to

Olive, J.P.,

'

and Nakatani, L. H. (1974). "Rule synthesis of

speech by word concatenation: Soc. Am. 55, 660-6(•6.

a first

step,"

J. Acoust.

Streeter, L. A. (1976). "Perceptual determinants of phrase boundary placement," J. Acoust. Soc.-Am. 60, S28(A).

Umeda, N. (1975). "Vowel duration in American English," J. Acoust.

Soc. Am.

58, 434-445.

Umeda, N. (1977). "Consonantduration in American English," J. Acoust.

Soc. Am.

61, 846-858.

J. Acoust.Soc.Am., Vol. 63, No. 1, January1978 Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 158.42.28.33 On: Mon, 01 Dec 2014 11:22:09

Hearing "words" without words: prosodic cues for word perception.

Hearing "words" without words: Prosodic cues for word perception Lloyd H. Nakatani Bell Laboratories, Murray Hill, New Jersey07974 Judith A. Schaffer...
2MB Sizes 0 Downloads 0 Views