INT J LANG COMMUN DISORD, MARCH–APRIL VOL.

2014,

49, NO. 2, 228–239

Research Report Assessment of intelligibility using children’s spontaneous speech: methodological aspects ˚ Tove B. Lagerberg, Jakob Asberg, Lena Hartelius and Christina Persson The Sahlgrenska Academy at the University of Gothenburg Institute of Neuroscience and Physiology, Division of Speech and Language Pathology, Gothenburg, Sweden

(Received January 2013; accepted September 2013) Abstract Background: Intelligibility is a speaker’s ability to convey a message to a listener. Including an assessment of intelligibility is essential in both research and clinical work relating to individuals with communication disorders due to speech impairment. Assessment of the intelligibility of spontaneous speech can be used as an overall indicator of the severity of a speech disorder. There is a lack of methods for measuring intelligibility on the basis of spontaneous speech. Aims: To investigate the validity and reliability of a method where listeners transcribe understandable words and an intelligibility score is calculated on the basis of the percentage of syllables perceived as understood. Methods & Procedures: Spontaneous speech from ten children with speech-sound disorders (mean age = 6.0 years) and ten children with typical speech and language development (mean age = 5.9 years) was recorded and presented to 20 listeners. Results were compared between the two groups and correlation with percentage of consonants correct (PCC) was examined. Outcomes & Results: The intelligibility scores obtained correlated with PCC in single words and differed significantly between the two groups, indicating high validity. Inter-judge reliability, analysed using intra-class correlation (ICC), was excellent in terms of the average measure for several listeners. Conclusions & Implications: The results suggest that this method can be recommended for assessing intelligibility, especially if the mean across several listeners is used. It could also be used in clinical settings when evaluating intelligibility over time, provided that the same listener makes the assessments. Keywords: intelligibility, PCC, continuous speech, spontaneous speech, speech disorder, children.

What this paper adds? What is already known on the subject? Intelligibility can be defined as the ability of the speaker to convey a message to the listener using the acoustic signal. This can be assessed by rating scales or by orthographic transcription of single words or spontaneous speech. When spontaneous speech is used, the gold standard is to compare listeners’ transcriptions with a master transcript made by clinicians and/or caregivers. However, this is a time-consuming procedure and there is a need for studies investigating alternative methods. What this paper adds? This is the first study to investigate the validity and reliability of a method for assessing intelligibility on the basis of orthographic transcription of spontaneous speech without a master transcript. Instead, the method investigated is based on the proportion of words perceived as understood by the listener and uses the number of syllables as the basis for calculating the intelligibility score. The results show satisfactory reliability and validity for both research and clinical purposes.

Address correspondence to: Tove B. Lagerberg, The Sahlgrenska Academy at the University of Gothenburg, Institute of Neuroscience and Physiology, Division of Speech and Language Pathology, Box 452, SE-405 30 Gothenburg, Sweden; e-mail: [email protected] International Journal of Language & Communication Disorders C 2013 Royal College of Speech and Language Therapists ISSN 1368-2822 print/ISSN 1460-6984 online  DOI: 10.1111/1460-6984.12067

Assessment of intelligibility in spontaneous speech Introduction One definition of intelligibility is a speaker’s ability to convey a message to a listener using the acoustic signal (Yorkston et al. 1996). Since conveying a message is among the purposes of most communication, it is appropriate to include an assessment of intelligibility thus defined in both research and clinical work relating to individuals with communication disorders due to speech impairment. In addition, an assessment of the intelligibility of spontaneous speech can be used as an overall indicator of the severity of a speech disorder since it allows the clinician to evaluate the sum of interacting processes that are involved in speech production’ p1 (Yorkston and Beukelman 1981). The definition of intelligibility and related concepts is the subject of an ongoing discussion. One rather narrow definition relates only to the acoustic speech signal (Kent et al. 1989), i.e. it refers to the degree to which the speaker’s intended message is transmitted to the listener by means of the acoustic speech signal without any contextual cues, such as linguistic cues or visual cues from non-verbal communication (Yorkston et al. 1996). Focusing on intelligibility using the above-mentioned narrow definition (Kent et al. 1994) is useful in the assessment of the therapeutic efficacy of interventions aiming to improve the speech signal (Yorkston et al. 1996). For this reason, that definition was chosen for the present study. However, the assessment of intelligibility— regardless of how it is defined—involves not only the speaker’s ability but also the type of speech material (e.g. reading, spontaneous speech or single words), the medium of transmission (e.g. live, audio or video recording), the listener’s background (e.g. naive or speech– language pathologist (SLP)) and the listener’s task (e.g. scale rating, transcription or multiple-choice). This leads to considerable variation in how intelligibility is and can be assessed; the choice of method in any given case has to be made with reference to the aim and purpose of the assessment as well as to the speaker’s ability (Morris and Wilcox 1999, Kent et al. 1994). The most frequently used method is for listeners to rate the level of intelligibility on an equal-appearing interval scale while listening to recordings of spontaneous speech (Whitehill 2002). However, this method has certain drawbacks. For example, it has been argued that intelligibility is a ‘prothetic continuum’ (i.e. quantitative and additive, like loudness) and that the listener is therefore unable to partition it into equal-appearing intervals (Schiavetti 1992). Schiavetti recommends the alternative method of transcribing words or syllables, which yields a percentage, i.e. a value on a ratio scale, and also enables a robust statistical analysis. As regards the type of material, the advantage of a structured speech material such as lists of single words is

229 that it can be designed to include phonemes that are particularly vulnerable to a specific type of speech error and can also make it possible to analyse the contribution of different types of error to the reduction of intelligibility (Klein and Flint 2006). At the same time, an assessment based on spontaneous speech has two major advantages: firstly, it includes supra-segmental features, which have been shown to affect intelligibility; and secondly, it is thought to be more representative of the speaker’s ability to use speech to communicate in daily life, meaning that it has a greater ecological validity than an assessment based on lists of words or sentences read out aloud (Kwiatkowski and Shriberg 1992). The main drawback of using spontaneous speech resides in the difficulty of knowing exactly what the speaker is saying, especially if she or he has a severe speech disorder. This makes it problematic to calculate words correctly understood. In fact, it has been shown that even parents are not always able to know exactly what their child intends to say (Kwiatkowski and Shriberg 1992). One way of addressing this difficulty is to use an orthographic transcription of the spontaneousspeech sample to make a ‘master transcript’, which is then held to be correct (Hodge and Gotzke 2007, Gordon-Brannan and Hodson 2000). It has been suggested that the use of master transcripts, employing narrow phonetic transcription based on information from caregivers and clinicians, is a valid and reliable method (Kwiatkowski and Shriberg 1992). However, the production of such transcripts is a time-consuming and elaborate task. What is more, one cannot be certain that they represent a correct transcription of what the speaker intended to say. The Weiss Intelligibility Test (Weiss 1982) includes both a single-word task and a continuous-speech task and uses a method without a master transcript. The method investigated in the present study is similar to the spontaneous-speech task in that test, where the listener is asked to transcribe 200 words and to mark words that are not understood with a hyphen or indicate words that are understood in a 200-box grid. The output of the test is the percentage of words understood. The validity of this method is described briefly in the test manual (Weiss 1982) but has not yet been studied thoroughly. Weiss analysed the test’s reliability using speech material from 60 speakers aged 3–64 years with various types of disorder, such as dysarthria, verbal dyspraxia and articulation and language delays. Five trained listeners and 25 untrained listeners made orthographic transcriptions from audio recordings. Inter-judge reliability was between 0.72 and 0.94, with higher values for speakers with a high intelligibility score and higher values for single words than for spontaneous speech. There was no significant difference between trained and untrained listeners. Intra-judge reliability was low, between 0.43 and 0.64. This was attributed to the judges’ familiarizing

230

Tove B. Lagerberg et al.

themselves with the subjects’ speech: scores were on average 12% higher for the second transcription (Weiss 1982). An additional problem when using spontaneous speech is how to determine the number of words in a speech sample when the speaker’s intelligibility is reduced (Flipsen 2006). Flipsen described four alternative approaches (Intelligibility Indexes or IIs) to solving this problem: II-Original, where word boundaries are established during transcription; II1.25, where a syllables-per-word index is used; II-PS, which is based on the assumption that the unintelligible parts have the same number of syllables per word as the intelligible parts for each speaker; and II-AN, where age-normalized values are used. An alternative approach, which is taken in the present study, is to count the number of syllables for both the intelligible and the unintelligible sequences in the calculation of intelligibility. There is clearly a need to explore and analyse methods for assessing intelligibility on the basis of spontaneous speech. To be a viable alternative (or complement) to rating scales in research contexts and in the clinical evaluation of interventions, such a method needs to be both valid and reliable yet not overly complicated and time-consuming. The aim of the present study was to investigate the reliability and validity of a method for assessing intelligibility. The method is based on the orthographic transcription of words perceived as understood and the identification of unintelligible syllables in spontaneous speech. It uses the percentage of syllables perceived as understood as the basis for calculating the intelligibility score. The following research questions were asked:

r Does the method yield results that are reliable in terms of inter- and intra-judge reliability?

r Does the method yield the same result for the same child (and listener) when applied to different samples of that child’s spontaneous speech? r Does the method yield results that correlate with another variable supposedly related to intelligibility, namely the percentage of consonants correct (PCC) (Shriberg and Kwiatkowski 1982)? r Does the method yield results that differ statistically significantly between two groups that could be assumed to differ in their intelligibility, namely children with speech-sound disorders and children with typical speech and language development? r Does the method accurately reclassify individual children as members of either group (children with speech-sound disorders versus children with typical speech and language development)?

Method Participants The speakers in this study were ten children with a speech-sound disorder (SSD group) (range = 4;6–8;3 years; mean = 6.0 years; SD = 1.0 years) and ten children with typical speech and language development (TD group) (range = 4;8–7;4 years; mean = 5.9 years; SD = 1.1 years). The SSD children were recruited through SLPs working at the Department of Paediatric Speech and Language Pathology at Sahlgrenska University Hospital in Gothenburg, Sweden. These children had been diagnosed as having a speech and language disorder that affected intelligibility according to the treating SLP. The children with typical speech and language development were recruited through contacts with schools and preschools in the same area; the inclusion criterion for these children was no past or present contact with an SLP. All participants had normal hearing and Swedish as their strongest language, as reported by their parents (table 1). The listeners were 18 students enrolled in an SLP study programme and two recent graduates. They were all females and between 20 and 35 years old. All had normal hearing and Swedish as their strongest language, according to self-reports. The assessment of consonants for the calculation of PCC was performed by two other SLP students (who also made the recordings; see below). Speech samples Two different types of recorded speech material were used: spontaneous speech to assess intelligibility; and single words to calculate PCC. The spontaneous speech was elicited through conversation about anything that a given child wished to talk about, such as toys that were present in the room or things that the child had done at school or on holiday. Single-word material was elicited using the Swedish Articulation and Nasality Test (SVANTE) (Lohmander et al. 2005). The part of that test used in the present study consists of pictures representing 76 words that, taken together, include all Swedish phonemes. These 76 words include a total of 157 consonants. The recordings were made by two SLP students using a TASCAM HD-P2 Portable High-Definition Stereo Audio Recorder with an external SONY ECMMS957 microphone placed on a table about 30 cm from the child. For the TD children, the location was a quiet room at their pre-school or school. The SSD children were recorded either in a quiet room at their school or pre-school, at the SLP clinic or at home, according to each family’s wishes. The speech material used for the assessment of intelligibility consisted of utterances from the child’s

Assessment of intelligibility in spontaneous speech

231

Table 1. Age and sex of speakers SSD group Participant

TD group

Age (months;years)

Sex

Participant

Age (months;years)

Sex

4;6 5;6 6;7 5;11 5;11 8;3 6;6 6;6 5;2 5;4

M F M F M M F F M F

TD-1 TD-2 TD-3 TD-4 TD-5 TD-6 TD-7 TD-8 TD-9 TD-10

4;8 4;11 7;3 5;0 5;3 7;4 7;3 6;6 5;10 4;10

M F M F M M F F M F

SSD-1 SSD-2 SSD-3 SSD-4 SSD-5 SSD-6 SSD-7 SSD-8 SSD-9 SSD-10

Note: F, female; M, male; SSD, children with speech-sound disorder; and TD, children with typical speech and language development.

Table 2. PCC for each participant, intelligibility scores of individual listeners for each participant, and mean intelligibility score for all four listeners in each listener group Intelligibility scores—Listener group A Participant

PCC

A1

A2

A3

A4

Mean (A1–A4)

61 81 99 72

82 95 97 97

78 95 100 99

78 94 98 97

73 93 96 93

77.8 94.2 97.8 96.5

SSD-8 SSD-9 TD-2 TD-6

49 61 98 100

B1 76 75 99 100

B2 30 52 100 100

SSD-5 SSD-7 TD-1 TD-7

70 65 79 100

C1 90 93 96 100

C2 76 82 98 99

SSD-2 SSD-3 TD-5 TD-8

60 54 97 100

D1 74 59 98 100

D2 74 62 100 100

SSD-1 SSD-10 TD-9 TD-10

59 61 96 100

E1 83 82 92 97

E2 96 77 96 100

SSD-4 SSD-6 TD-3 TD-4

spontaneous speech (the speech of the SLP student was removed), each edited to a length of 1–15 words using Audacity 1.3 Beta (Unicode). The editing was carried out manually and done as far as possible in accordance with semantically natural pauses so as not to interfere with content. Utterances consisting of only ‘yes’ or ‘no’ were not included. For each child, a speech sample consisting of utterances totalling approximately 100 words was prepared (including the entire utterance that contained the 100th word). In two cases, the original record-

Listener group B B3 B4 42 82 80 73 100 100 99 100 Listener group C C3 C4 92 88 82 84 96 100 100 100 Listener group D D3 D4 62 30 55 29 94 94 99 99 Listener group E E3 E4 75 93 64 89 89 97 97 99

Mean (B1–B4) 57.5 70.0 99.8 99.8 Mean (C1–C4) 86.5 85.2 95.5 99.8 Mean (D1–D4) 60.0 51.2 96.5 99.5 Mean (E1–E4) 87.8 78.0 93.5 98.2

ing contained fewer than 100 words (SSD-3 = 50, TD1 = 52). Even so, these cases were included, since the difficulty in obtaining a substantial amount of spontaneous speech from some children may actually, in itself, have been a sign of their speech difficulties (due, for example, to a reluctance to speak). Eight children in the SSD group produced enough speech for two samples to be obtained. These were included to investigate the effect of different speech samples on assessed intelligibility for the same child and the same listener (table 6). The

232

Tove B. Lagerberg et al.

Table 3. Intra-class correlation (ICC) for the five listener groups Listener group Single measure Average measurea

A

B

C

D

E

0.94 0.98

0.68 0.89

0.67 0.89

0.78 0.93

0.56 0.84

Note: a The average measure is indicative of the reliability of the scale when scores represent the average of different listeners’ judgements (see note 1).

Table 4. Intra-class correlation (ICC) for all subjects, for the SSD group and for the TD group

Single measure Average measurea

SSD group (n = 10)

TD group (n = 10)

Total (n = 20)

0.48 0.79

0.48 0.78

0.71 0.91

Note: a The average measure is indicative of the reliability of the scale when scores represent the average of different listeners’ judgements (see note 1).

Table 5. Intra-judge reliability: difference in scores (percentage points) between the first and second transcription of duplicate samples for the four listeners for each child First versus second transcription SSD-1 SSD-2 SSD-3 SSD-4 SSD-7 SSD-8

−3 10 12 6 3 9

−6 19 8 5 −2 5

−7 11 9 10 7 −5

5 11 1 10 4 9

average length of utterances for all children in the study was 6.5 words (SD = 1.9; minimum = 1.7; maximum = 9.5). An unpaired-samples t-test did not reveal any statistically significant difference between the two groups in the average length of utterances (t(18) = 0.763; p = 0.45). The sound files were transferred to software created specifically for the study to be used for the transcription task and the calculation of the proportion of syllables perceived as understood. Assessment procedure: PCC PCC was assessed by the two SLP students who had made the recordings. The speech samples were presented in random order, making the listeners blind to whether or not a child had SSD. Vowels were not included in the assessment. Consensus judgement was used for the assessment of PCC (Shriberg et al. 1997a). The listeners transcribed (orthographically) the correct consonants only (not the incorrect ones) for each child and discussed their transcriptions to reach a consensus. The decisions on whether a given consonant was correct or incorrect were made in accordance with the scoring rules of Shriberg and Kwiatkowski (1982), according to which a consonant should be scored as incorrect if it includes one of six types of consonant sound change (deletion

Table 6. Difference in scores (percentage points) between the two different speech samples from each child, for each of the four listener transcriptions, and between the mean of the four listener transcriptions for each child Speech sample 1 versus 2 Listener transcription SSD-1 SSD-2 SSD-5 SSD-6 SSD-7 SSD-8 SSD-9 SSD-10

1

2

3

4

Mean

−5 −14 1 2 −3 6 −12 3

−15 −8 18 −1 12 29 −36 8

−10 −18 −13 6 13 27 −40 3

1 −5 8 −5 2 6 −7 11

−7 −11 4 1 6 17 −24 6

of target consonant, substitution of another sound for a target consonant, partial voicing of initial target consonant, addition of a sound to a correct or incorrect target consonant, initial /h/ deletion in stressed syllables). Alternative pronunciations due to dialect or colloquial language were counted as correct where appropriate. PCC for each child was calculated as follows: PCC = [Total number of correctly pronounced consonants] /[total number of consonants (157)] × 100 Assessment procedure: intelligibility Since it would have been too time-consuming for each listener to transcribe all speech samples, the speech material was split into five sets, each including four speakers, and the 20 listeners were randomly divided into five groups (A–E). Each group was then assigned a set of speech material consisting of recordings of four of the children (two from the TD group and two from the SSD group) (table 2). Transcribing the utterances from one child took approximately 15 min. An attempt was made to give each listener group approximately the same listener burden by distributing the participants across listener groups in such a way that the average PCC for each set was approximately the same. To enable the assessment of intra-judge reliability, 30% of the speech samples were reassessed by the listeners (table 5). Both samples from each of the eight SSD children who had produced enough material for a second sample were assessed by the listener group assigned the respective child (table 6). The samples always appeared in the same order. The listeners listened to each utterance twice before moving on to the next utterance, and they went through all the utterances from one child before proceeding to the next (where two samples from the same child were included, each sample was counted as a separate child in

Assessment of intelligibility in spontaneous speech this context). The sound files were played to the listeners using the software developed specially for the study, and the listeners made their transcriptions using that software. The listeners used headphones (Sennheiser HD 212 Pro) and could adjust the volume as they wished. They were instructed to transcribe orthographically all words they understood and to mark each syllable which they did not understand with ‘0’. Transcriptions were made into a TextGrid with a specific space for each utterance. Guessing was discouraged. Further, the listeners were told to transcribe any onomatopoetic words and any revisions of phrases and words, but to leave out any repetitions of syllables or phonemes and any filler words such as ‘eh’. Before performing the work to be used in the study, each listener first practised by transcribing five examples which she was able to discuss with the test leader, who remained present to answer questions throughout the listening session. The listeners made their transcriptions and listened at their own pace, selecting a dialogue box when they were ready to listen to an utterance for the second time or to a new utterance. Finally, the software calculated the number of syllables transcribed (by counting vowel characters written) and the total number of syllables (by counting vowels and zeros). In the case of two vowel characters in a digraph, this usually corresponds phonetically to a diphthong, meaning that both vowels are part of the same syllable. This issue is partly handled by the software, given that there has to be an orthographic space between two vowels in order for them to be counted as different syllables. The intelligibility score for each participant was calculated as follows: Intelligibility = [Total number of syllables in transcribed words] /[total number of syllables in the sample] × 100 Analytical strategies and statistics The construct validity of a test can be studied by examining whether the test results correlate with some other variable that is hypothesized to be related to the variable that the test is supposed to measure (Streiner and Norman 2008). Since PCC was created to assess the severity of involvement, encompassing disability, intelligibility and handicap (Shriberg and Kwiatkowski 1982), it was hypothesized to be related to intelligibility as assessed in the present study. PCC was originally designed to be used for spontaneous speech (Shriberg and Kwiatkowski 1982), but it has been applied to singleword speech material as well (Zajac et al. 2010). Shriberg has also suggested that single-word material may be used, provided that the results are not to be related to the severity classifications in the validation study (Shriberg et al.

233 1997b, Shriberg and Kwiatkowski 1982). Further, it has been shown that there is a strong relationship between intelligibility in single words and in continuous speech, at least when predetermined sentence samples are used (Weismer et al. 2001, Yorkston and Beukelman 1978). Hence, in the present study it was hypothesized that intelligibility scores for spontaneous speech would correlate with PCC scores for single words. Secondly, since intelligibility is related to deviant speech (Kent 1992), it was also hypothesized that intelligibility scores would be different in groups with and without deviant speech. A valid intelligibility assessment method should therefore be able, to some extent, to detect differences between individuals with and without speech deviances (Hodge and Gotzke 2007, Lillvik et al. 1999, Zajac et al. 2010). Reliability was investigated in terms of inter-judge and intra-judge reliability as well as consistency of assessment for two different speech samples from the same child. To analyse the inter-judge reliability of intelligibility scores, intra-class correlation (ICC) was used in two different ways. First, ICC was calculated for the speech material used in each of the five listener groups (A–E) so that five different values for ICC were obtained, one for each listener group (table 3). Second, ICC was calculated for the speech material from all 20 children (table 4). To make this possible, the results for the listeners with the same number, i.e. A1–E1, A2–E2, A3–E3 and A4–E4, were grouped together and taken to represent one judge (table 2). Since there were four listeners in each listener group, this yielded a total of four (represented) judges to be included in the calculation of ICC. This operation was not considered to enter an unacceptable amount of noise into the analyses: given that full randomization had been carried out, there was no reason to expect any systematic differences between the listeners. Nonetheless, it is important to note that if this procedure did affect the results, its effect would be to lower the reliability of the assessment method. To analyse the consistency of assessment for two different speech samples from the same child, each ‘listener–child combination’ was considered as a single item. This yielded 32 items (eight children with four listeners each) in each variable (first and second speech materials for that particular child) (table 6). The distributions were found to be approximately normal according to histogram inspection and calculation of skewness and kurtosis, with only one exception: PCC in the TD group was excessively high in kurtosis (> 1.96). Therefore, the correlation between intelligibility and PCC in the TD group was examined with Spearman’s non-parametric correlation. Parametric Pearson correlation tests were used to determine whether an association existed between PCC and intelligibility score in the SSD group, to calculate intra-judge reliability for the intelligibility assessment and to analyse the

234

Tove B. Lagerberg et al.

association between the two different speech samples. To investigate whether there was any statistically significant difference in terms of intelligibility results between the SSD group and the TD group, an unpaired-samples t-test was applied; the results are reported with effect size expressed in r. Finally, a discriminant function analysis was conducted to examine whether the intelligibility score could be used to correctly identify participants as regards group membership (i.e. SSD versus TD). To reduce the risk of type 1 errors, p < 0.01 was interpreted as statistically significant. Results Descriptive data and background analyses Table 2 presents, for each participant, the PCC score, the intelligibility score awarded by each listener and the mean intelligibility score for all four listeners. To analyse whether age influenced intelligibility scores, a correlation was sought between age and intelligibility. This correlation was found not to be statistically significant (r = 0.03). Reliability Inter-judge reliability was investigated using ICC. As the aim of this study was to investigate the usefulness of the intelligibility assessment method in both clinical and research contexts, both single and average measures are reported.1 It has been suggested that ICC > 0.75 should represent excellent reliability, ICC = 0.4–0.75 fair to poor reliability and ICC < 0.4 poor reliability (Fleiss 1986). According to this scale, two listener groups had excellent reliability in terms of single measure, while three had fair to poor. In contrast, all listener groups had excellent reliability in terms of average measure (table 3). As regards the ICC for all listeners (n = 20), reliability in terms of average measure was excellent for both the SSD group and the TD group separately, as well as for the total group (n = 20). In terms of single measure, on the other hand, reliability was fair to poor in all three cases (table 4). Intra-judge reliability was investigated on the basis of the six duplicate samples presented to the four listeners. The Pearson correlation coefficient was strong and statistically significant (r = 0.94; p < 0.01). In 75% of the 24 cases, the score was higher for the second transcription. In 83% of the cases, the difference was less than 10 percentage points (table 5). Different speech samples The correlation between the two different speech samples from the same child was statistically significant

(r = 0.73; p < 0.01). In addition, a paired-samples t-test identified no statistically significant difference between the results for the two materials (t(31) = 0.65; p = 0.52). Of the 32 comparisons of differences between the two samples, 53% yielded a positive value, i.e. the last sample gave a higher intelligibility score. In 72% of these 32 cases, the difference was less than 10 percentage points. Validity For the entire sample (n = 20), the correlation between intelligibility score and PCC was statistically significant (r = 0.84; p < 0.01) (figure 1). Analysed separately, the correlation between intelligibility score and PCC was statistically significant in both the SSD group (r = 0.79; p < 0.01) and the TD group (rho = 0.77; p < 0.01). The mean intelligibility score for the SSD group was 74.7 (SD = 14.5; minimum = 51; maximum = 94), while that for the TD group was 97.7 (SD = 2.1; minimum = 94; maximum = 100) (figure 2). This difference in intelligibility score between the two groups is statistically significant (t(9.40 with equal variances not assumed) = 4.96; p < 0.001; r = 0.74). A discriminant function analysis was conducted to establish whether the individual children in the two groups—with and without SSD—could be reclassified to the correct group based on the result of the intelligibility assessment. A statistically significant discriminant function was found (Wilks’ λ = 0.422; χ 2 (1, n = 20) = 15.09; p < 0.0001). Specifically, all of the children in the TD group were classified together. Seven of the children with SSD were classified together, while three of them (SSD-1, SSD-5 and SSD-6) were classified with the TD children. Given that these misclassifications might be clinically and theoretically important, a closer examination of these three children’s characteristics was carried out. It emerged that two of them (SSD-5 and SSD-6) had the two highest PCC scores in the SSD group. An analysis of the types of error that these three children had made in the PCC material showed that those of SSD-1 and SSD5 related to omissions and phonological processes such as stopping and fronting, while the majority of SSD-6’s articulation errors were misarticulations of /r/. It was further found that these three children had the three highest values for average utterance length (as edited in their speech samples) in the SSD group (7.4, 9.1 and 7.9 words/utterance for SSD-1, SSD-5 and SSD-6, respectively). To investigate whether the observed association between average utterance length and intelligibility score held for all participants (not just for these three), correlation analyses were performed. The correlation proved to be statistically significant in the SSD group (r = 0.78; p < 0.01), but not in the TD group (r = 0.65).

Assessment of intelligibility in spontaneous speech

235

Figure 1. Scatterplot of intelligibility scores and PCCs for all children.

Discussion The present study has investigated the reliability and validity of a method for assessing speech intelligibility in children. The method is based on children’s spontaneous speech and involves the orthographic transcription by a listener of words perceived as understood along with a count of the syllables in the portions not understood and the calculation of the percentage of syllables perceived by the listener as understood. The results were fairly satisfactory in terms of both reliability and validity; they will be discussed in greater detail below. The reliability of the method varied somewhat depending on how reliability was assessed and whether the full sample or individual listener groups were considered. Inter-judge reliability in terms of ICC average measures was found to be excellent both for the full sample and for all five listener groups. In other words, the method is most certainly reliable for use in research where the key variable consists of a mean across several judges (Shrout and Fleiss 1979). However, in terms of ICC single measures, which are more relevant when there is only one listener or judge, as is typically the case in clinical work, the method’s reliability attained

the ‘excellent’ level for two of the listener groups only; reliability values for the three other groups as well as for the full sample were ‘fair to poor’. Two of the listeners (D4 and B2) produced very low intelligibility scores. The existence of inter-listener variability in performance on intelligibility assessments has been described before, for example in a study of listener strategies in relation to dysarthric speakers (Hustad et al. 2011). An impressionistic observation by the present authors in this respect is that some listeners (including D4 and B2) appear to have a tendency to be stricter in their transcriptions, preferring to mark a syllable with ‘0’ whenever they are not completely sure. It has been demonstrated that practice in transcribing (dysarthric) speech can lead to higher assessed intelligibility, at least in relation to the same speaker (Hustad and Cahill 2003). Note, however, that when they listened to a sample twice, D4 and B2 obtained the same result on both occasions. Hence, although certain listener characteristics such as personality and proficiency might be important factors to consider when interpreting specific scores for a child, these effects appear to be stable. This is an indication that the assessment method is appropriate for clinical work and for the evaluation of treatment interventions, provided that the

236

Tove B. Lagerberg et al.

Figure 2. Intelligibility scores for the individual members of the two groups (SSD and TD).

same clinician makes the assessments, based on recordings, and that this clinician has received some training in transcribing for intelligibility assessment. Comparing the ICC values in the present study with those from other studies is difficult because most previous relevant studies do not specify whether they report single or average measures. Hodge and Gotzke (2007), comparing transcription of spontaneous speech with a master transcript using orthographic transcription, report an ICC of 0.99 for children with cleft palate and 0.86 for children without cleft palate. Another study that included assessment based on orthographic transcription of word lists had an ICC ranging from 0.94 to 0.99 for different listener groups (Zajac et al. 2010). In contrast, the method used in the present study attained ICC scores of between 0.84 and 0.99 using a non-structured speech material (spontaneous speech) without any type of master transcript, circumstances that could be expected to yield greater variation in results than a structured speech material such as word lists As regards intra-judge reliability, the current study had r = 0.94, which is satisfactory compared with the earlier studies referred to above. The intelligibility score was higher the second time round for the majority of the speech material. This could be due to an effect of familiarization (Hustad and Cahill 2003), yielding higher results when hearing the utterances a second time. In the study by Zajac et al. (2010), the Pearson correlation coefficients ranged from 0.92 to 0.95 for word lists; and

Gordon-Brannan and Hodson (2000) reported intrajudge reliability values of r = 0.91–1.00 based on ratings of spontaneous speech. Unfortunately, comparisons with the values reported in the test manual of the Weiss Intelligibility Test (Weiss 1982) are not feasible since the values for the single-word task cannot be separated from those for the spontaneous-speech task. The present study also investigated the relationship between two different speech samples from the same child and found strong consistency at listener-group level. It is important to note that the speech samples were always presented in the same order, which could give the same effect of familiarization as for intra-judge reliability, with higher scores for the second speech sample. However, almost as many listeners obtained lower as higher scores the second time round. A closer look at individual results shows that the difference between scores ranged from zero to as much as 24 percentage points (mean of the four listeners). However, in as many as 72% of the cases, the difference was less than 10 percentage points. The variation observed between speech samples from the same child is interesting and should be taken into consideration when intelligibility for spontaneous speech is analysed, especially in studies aiming to detect changes in intelligibility before and after treatment. One possible solution to this potential problem might be to supplement results from spontaneous-speech samples with a structured speech material such as single words or sentences. Further, it is important to distinguish

Assessment of intelligibility in spontaneous speech between ‘real’ differences in intelligibility score, i.e. those that reflect a change in the speaker’s ability to make herself or himself understood, and differences that simply reflect random variation, due for example to the content of the speech material. An important advantage of the method used in the present study is that it circumvents the problem of estimating the number of words in unintelligible strings that has been discussed in earlier studies (Flipsen 2006, Shriberg et al. 1997b). A major conclusion made by Flipsen (2006) was that the number of words in unintelligible strings can in fact be estimated in a way that makes it possible to assess intelligibility based on spontaneous speech. Further, Flipsen concluded that—provided that a fixed value for syllables per word is accepted—three of the suggested measures were acceptable: II-Original, where word boundaries are established during transcription; II-1.25, where a syllables-per-word index is used; and II-AN, where age-normalized values are used. IIOriginal proved to be the most time-efficient measure. It would be of great interest in future studies to analyse the same type of data as in the present study using the four options presented by Flipsen (2006).2 The children in the present study with typical speech development had a mean intelligibility score of 97.7% and the children with SSD had a mean of 74.7%. Earlier studies have reported values between 85% and 100% for children with typical speech development and between 64.0% and 89.4% for children with atypical speech (Hodge and Gotzke 2007, Weiss 1982, Gordon-Brannan and Hodson 2000, Flipsen 2006). The methods used are not exactly the same, but comparison remains possible given that all of these studies used spontaneous speech. The fact that the results of the present study are similar to those obtained in earlier studies is one argument in favour of the validity of the method explored here. The correlation between intelligibility scores and PCC also indicates sufficient validity. The correlation between PCC and intelligibility for the group with SSD studied here was similar in size to the values reported by Zajac et al. (2010) (i.e. r ≈ 0.8). The intelligibility assessment method analysed in the present study was also effective in separating the two groups of children, which is commonly seen as an indication of good validity (Hodge and Gotzke 2007, Zajac et al. 2010). To further investigate the ability of the method to identify children with speech disorders, a discriminant function analysis was performed; the results were statistically significant. No child in the TD group was incorrectly classified by the method as a member of the SSD group, but three of the children in the SSD group were incorrectly grouped together with the children in the TD group. Thus, the assessment method does not appear to yield many ‘false alarms’. However, some children obtained a relatively

237 high score on the intelligibility assessment despite having an SSD. In fact, two of these children (SSD-5 and SSD-6) had the two highest PCC scores in the SSD group, as would have been expected, but the third one (SSD-1) had a fairly low PCC score—which could be due to the fact that PCC and intelligibility were calculated using different speech materials (single words and spontaneous speech, respectively). A previous study (Lagerberg et al. 2013) found a difference of as much as 16 percentage points in mean intelligibility score between the single-word task and the spontaneous-speech task, with the higher average score for spontaneous speech; that study, however, analysed only intelligibility, not PCC. The issue of the complex and by no means straightforward links between PCC and intelligibility has been the subject of discussion for a long time, which testifies to the fact that speech intelligibility is affected not just by articulation but also by various factors such as prosody-voice involvement and specific patterns of error types (Shriberg and Kwiatkowski 1982, Kwiatkowski and Shriberg 1992). An analysis of speech-error types did not reveal any obvious pattern common to the three children in question. The errors made by SSD-6 were almost exclusively misarticulation of /r/, which is very common in Swedish children. The other two children, SSD-1 and SSD-5, had more manifestations of phonological processes such as stopping and fronting, which are the most common processes in Swedish-speaking children with phonological disorders (Lohmander et al. 2005). The type of speech error can affect the impact of the speech and/or language disorder at the level of intelligibility, with fronting of velars, for example, having less impact than stopping of fricatives and affricates when the speech material is adjusted to reflect the frequency of these distortions in spontaneous speech (Klein and Flint 2006). However, this finding refers to English and it cannot be assumed that the specifics of this differentiated impact are the same for all languages. To our knowledge, no corresponding study has been carried out for Swedish. All three children with unexpectedly high intelligibility were found to be talkative: they yielded the longest utterances among the children with SSD in the edited speech material. The correlation between average utterance length and intelligibility score was found to be statistically significant for the entire SSD group. The reason for this observed association between utterance length (or size of speech material) and intelligibility is not entirely clear. One possibility is that the sheer size of the speech material brings more context to individual words, which might be useful for listeners performing the transcription task. Such a hypothesis is in line with theories of speech perception stressing the importance of top-down—or meaning-based—processes to decipher speech (Hustad 2007), and the existence of such

238 effects has been confirmed in studies of dysarthric speakers, where narratives yielded higher intelligibility scores than isolated sentences and words (Hustad 2007), as well as in studies of listener strategies for understanding dysarthric speech (Hustad et al. 2011). Another—not mutually exclusive— possibility is that more talkative children include additional features in their speech. For instance, the child with relatively high intelligibility but low PCC (SSD-1) evidently enjoyed talking, and was engaged and vivid in his expression. Perhaps such communicative vividness adds prosodic and linguistic cues to the speech signal that make him easier to understand despite his SSD. Further studies with larger samples are needed to explore these hypotheses and features in depth. The fact that all children who were diagnosed with SSD were not detected by the method implies that it should not be used as the only referral instrument, but rather together with other instruments, such as speech and language assessment combined with a method that captures contextual factors, such as the Intelligibility in Context Scale (McLeod et al. 2012). One limitation of the present study is the small size of the sample. The clinical implications are that the method investigated in the present study can be used in research that uses results from several listeners and also as a clinical tool to evaluate the development of a child’s intelligibility if the same clinician makes all the assessments. Further, the method is relatively easy to use since there is no need to create a master transcript. If the same treating clinician makes subsequent assessments of the same patient, the effect of familiarization must be taken into account. Ideally, an independent judge would be preferred. However, in spite of these limitations, we believe that the described method is an important addition to existing methods for characterizing children’s intelligibility in spontaneous speech, particularly through its relatively low reliance on subjective judgments. Although this method does not give any guidance as to which articulatory processes need to be trained to improve intelligibility, it can be useful for the assessment of a child’s speech beyond an assessment of articulation, in relation to the impact of the speech disorder on the child’s functional verbal communication. As regards the practical implications of test scores, Gordon-Brannan and Hodson (2000) recommended that a 4-year-old with a score lower than 66% should be subject to intervention and stated that 90% could be set as the limit when intervention is no longer needed based on transcription of continuous speech. When applying such cut-offs, however, it is crucial to specify and consider the type of speech material used (Lagerberg et al. 2013). Cut-offs should probably be set at a higher level for spontaneous speech than for single words. Whether additional diagnostic information can be obtained by in-

Tove B. Lagerberg et al. cluding a PCC cut-off as well would be an interesting question for future research. The method proposed in the present study is objective, at least compared with ratings by caregivers or others, and it could be a useful tool to determine whether speech is a possible means for a child to convey her or his message at all. If the score is too low—below 60% has been suggested as the limit for when speech is perceived as unintelligible (Monsen 1981)—augmentative and alternative communication such as compensatory strategies might be needed for an individual to be able to use speech to communicate her or his message effectively. Finally, at least judging from the small sample studied, utterance length seems to be a variable that needs to be kept as equal as possible across samples when spontaneous speech is edited for use in intelligibility assessment. Further studies should be carried out on larger samples and should investigate results comparing word lists and spontaneous speech. Moreover, exploring the impact of speech material characteristics such as utterance length and content of the spontaneous-speech task could add to our understanding of intelligibility assessment (and indeed of the concept of ‘intelligibility’). It would also be interesting to investigate the impact which the number of repetitions exerts on listeners as regards reliability, validity and level of intelligibility. Conclusion Taken together, the results of the present study show that the method for assessing intelligibility on the basis of the percentage of syllables perceived as understood has high validity and reliability for research purposes if the mean across several listeners is used. This method could be considered as an alternative or a supplement to the use of rating scales in research that involves intelligibility assessment. In addition, the proposed method is easier to perform than methods involving the making of a master transcript with which to compare listeners’ transcriptions, meaning that it facilitates the use of spontaneous speech in transcription-based intelligibility assessment. The method could also be used in clinical work if the same listener makes both the initial and the follow-up assessments, for instance when evaluating an intervention aimed at improving speech. Acknowledgements The authors wish to thank the speech–language pathologists AnnaKarin Ahlman, MSc, and Andrea B¨orjesson, MSc, for valuable help with data collection; and research engineer Jonas Lindh, who created the software used in the study. Declaration of interest: This study was supported in part by the Grant Number 2010-2131 from Swedish Research Council and the Petter Silfverski¨old Memorial Fund. The responsibility for the content of this article lies with the authors alone. The authors report no declarations of interest.

Assessment of intelligibility in spontaneous speech Notes 1. The output of ICC is of two types: single measures and average measures. Single measures should be reported when an assessment method is intended for use by a single judge, e.g. in clinical work, whereas average measures are more appropriate when an assessment method is designed to be used in research, where the mean score of several judges is frequently used (Shrout and Fleiss 1979). Hence, the single ICC measure is indicative of the reliability of the scale when a sample is judged by a single listener, whereas the average ICC measure is indicative of the reliability of the scale when scores represent the average of different listeners’ judgements. 2. In principle, the results in terms of intelligibility scores obtained in the present study can be compared with results from studies where the intelligibility score is based on word count by using a syllables-per-word (SPW) index of 1.25 (Shriberg et al. 1997b). This is because the transformation of the score would involve dividing both the numerator and the denominator by the SPW index, which would not change the ratio. It is uncertain whether 1.25 is a valid estimate of the mean number of syllables per word in the Swedish language.

References FLEISS, J. L., 1986, The Design and Analysis of Clinical Experiments (New York, NY: Wiley). FLIPSEN, P., 2006, Measuring the intelligibility of conversational speech in children. Clinical Linguistics and Phonetics, 20, 303– 312. GORDON-BRANNAN, M. and HODSON, B. W., 2000, Intelligibility/severity measurements of prekindergarten children’s speech. American Journal of Speech–Language Pathology, 9, 141–150. HODGE, M. and GOTZKE, C. L., 2007, Preliminary results of an intelligibility measure for English-speaking children with cleft palate. Cleft Palate and Craniofacical Journal, 44, 163– 174. HUSTAD, K. C., 2007, Effects of speech stimuli and dysarthria severity on intelligibility scores and listener confidence ratings for speakers with cerebral palsy. Folia Phoniatrica et Logopaedica, 59, 306–317. HUSTAD, K. C. and CAHILL, M. A., 2003, Effects of presentation mode and repeated familiarization on intelligibility of dysarthric speech. American Journal of Speech–Language Pathology, 12, 198–208. HUSTAD, K. C., DARDIS, C. M. and KRAMPER, A. J., 2011, Use of listening strategies for the speech of individuals with dysarthria and cerebral palsy. Augmentative and alternative communication, 27, 5–15. LAGERBERG, T. B., LOHMANDER, A. and PERSSON, C., forthcoming 2013, Assessing intelligibility by single words, sentences and spontaneous speech: a methodological study of the speech production of 10-year olds. Logopedics Phoniatrics and Vocology. E-pub ahead of print. KENT, R. A., 1992, Intelligibility in Speech Disorders: Theory, Measurement and Management (Amsterdam: Benjamin). KENT, R. D., MIOLO, G. and BLOEDEL, S., 1994, The intelligibility of children’s speech: a review of evaluation procedures. American Journal of Speech–Language Pathology, 3, 81–95. KENT, R. D., WEISMER, G., KENT, J. F. and ROSENBEK, J. C., 1989, Toward phonetic intelligibility testing in dysarthria. Journal of Speech and Hearing Disorders, 54, 482–499.

239 KLEIN, E. S. and FLINT, C. B., 2006, Measurement of intelligibility in disordered speech. Language Speech and Hearing Services in Schools, 37, 191–199. KWIATKOWSKI, J. and SHRIBERG, L. D., 1992, Intelligibility assessment in developmental phonological disorders: accuracy of caregiver gloss. Journal of Speech and Hearing Research, 35, 1095–1104. ¨ , P. and HARTELIUS, L., LILLVIK, M., ALLEMARK, E., KARLSTROM 1999, Intelligibility of dysarthric speech in words and sentences: development of a computerised assessment procedure in Swedish. Logopedics Phoniatrics Vocology, 24, 107–119. LOHMANDER, A., BORELL, E., HENNINGSSON, G., HAVSTAM, C., LUNDEBORG, I. and PERSSON, C., 2005, SVANTE: svenskt artikulations- och nasalitets-test (Skivarp: Pedagogisk Design). MCLEOD, S., HARRISON, L. J. and MCCORMACK, J., 2012, Intelligibility in context scale: validity and reliability of a subjective rating measure. Journal of Speech, Language and Hearing Research, 55, 648–656. MONSEN, R. B., 1981, A usable test for the speech intelligibility of deaf talkers. American Annals of the Deaf, 126, 845–852. MORRIS, S. R. and WILCOX, K. A., 1999, The Children’s Speech Intelligibility Measure (San Antonio, TX: Psychological Corporation). SCHIAVETTI, N., 1992, Scaling procedures for the measurement of speech intelligibility. In R. D. Kent (ed.), Intelligibility in Speech Disorders (Amsterdam: John Benjamins), pp. 11–34. SHRIBERG, L. D., AUSTIN, D., LEWIS, B. A., MCSWEENY, J. L. and WILSON, D. L., 1997a, The percentage of consonants correct (PCC) metric: extensions and reliability data. Journal of Speech, Language and Hearing Research, 40, 708–722. SHRIBERG, L. D., AUSTIN, D., LEWIS, B. A., MCSWEENY, J. L. and WILSON, D. L., 1997b, The Speech Disorders Classification System (SDCS): extensions and lifespan reference data. Journal of Speech, Language and Hearing Research, 40, 723–740. SHRIBERG, L. D. and KWIATKOWSKI, J., 1982, Phonological disorders III: a procedure for assessing severity of involvement. Journal of Speech and Hearing Disorders, 47, 256–270. SHROUT, P. E. and FLEISS, J. L., 1979, Intraclass correlations: uses in assessing rater reliability. Psychology Bulletin, 86, 420–428. STREINER, D. L. and NORMAN, G. R., 2008, Health Measurement Scales: A Practical Guide to their Development and Use, 4th edn (Oxford: Oxford University Press). WEISMER, G., JENG, J. Y., LAURES, J. S., KENT, R. D. and KENT, J. F., 2001, Acoustic and intelligibility characteristics of sentence production in neurogenic speech disorders. Folia phoniatrica et logopaedica, 53, 1–18. WEISS, C. E., 1982, Weiss Intelligibility Test (Tigard, OR: CC Publ.). WHITEHILL, T. L., 2002, Assessing intelligibility in speakers with cleft palate: a critical review of the literature. Cleft Palate and Craniofacical Journal, 39, 50–58. YORKSTON, K. M. and BEUKELMAN, D. R., 1978, A comparison of techniques for measuring intelligibility of dysarthric speech. Journal of Communication Disorders, 11, 499–512. YORKSTON, K. M. and BEUKELMAN, D. R., 1981, Assessment of Intelligibility of Dysarthric Speech (Tigard, OR: CC Publ.). YORKSTON, K. M., STRAND, E. A. and KENNEDY, M. R. T., 1996, Comprehensibility of dysarthric speech: implications for assessment and treatment planning. American Journal of Speech– Language Pathology, 5, 55–66. ZAJAC, D., PLANTE, C., LLOYD, A. and HALEY, K., 2010, Reliability and validity of a computer mediated single-word intelligibility test: preliminary findings for children with repaired cleft lip and palate. Cleft Palate and Craniofacical Journal, 48, 538– 549.

Assessment of intelligibility using children's spontaneous speech: methodological aspects.

Intelligibility is a speaker's ability to convey a message to a listener. Including an assessment of intelligibility is essential in both research and...
277KB Sizes 0 Downloads 0 Views