INT J LANG COMMUN DISORD, JULY VOL. 50, NO. 4, 476–487

2015,

Research Report Effect of the number of presentations on listener transcriptions and reliability in the assessment of speech intelligibility in children ˚ Tove B. Lagerberg, Jakob Asberg Johnels, Lena Hartelius and Christina Persson The Sahlgrenska Academy, University of Gothenburg, Institute of Neuroscience and Physiology, Division of Speech and Language Pathology, Gothenburg, Sweden

(Received February 2014; accepted November 2014) Abstract Background: The assessment of intelligibility is an essential part of establishing the severity of a speech disorder. The intelligibility of a speaker is affected by a number of different variables relating, inter alia, to the speech material, the listener and the listener task. Aims: To explore the impact of the number of presentations of the utterances on assessments of intelligibility based on orthographic transcription of spontaneous speech, specifically the impact on intelligibility scores, reliability and intra-listener variability. Methods & Procedures: Speech from 12 children (aged 4:6–8:3 years; mean = 5:10 years) with percentage consonants correct (PCC) scores ranging from 49 to 81 was listened to by 18 students on the speech–language pathology (SLP) programme and by two recent graduates from that programme. Three conditions were examined during the transcription phase: (1) listening to each utterance once; (2) listening to each utterance a second time; and (3) listening to all utterances from a given child a third time after having heard all of its utterances twice. Outcomes & Results: Statistically significant differences between intelligibility scores were found across the three conditions, i.e. the intelligibility score increased with the number of presentations while inter-judge reliability was unchanged. The results differed markedly across listeners, but each individual listener’s results were very consistent across conditions. Conclusions & Implications: Information about the number of times an utterance is presented to the listener is important and should therefore always be included in reports of research involving intelligibility assessment. There is a need for further research and discussion on listener abilities and strategies. Keywords: speech intelligibility, number of presentations, listening conditions, assessment of intelligibility.

What this paper adds? What is already known on the subject? ‘Intelligibility’ can be defined as a speaker’s ability to convey a message to a listener by means of the acoustic signal. Intelligibility assessment gives an overall picture of the sum of all factors involved in speech production and is therefore essential both in the exploration of the consequences of a speech disorder and in the evaluation of treatment outcome. It is well known that intelligibility assessment results are affected by many different variables such as type of speech material, listener task and listener characteristics. What this paper adds? This paper investigates a variable that may reasonably be hypothesized to affect intelligibility assessment results but has not previously been systematically investigated, namely the number of times the utterance is presented to the listeners. Intelligibility scores were calculated on the basis of the orthographic transcription of spontaneous speech,

Address correspondence to: Tove B. Lagerberg, The Sahlgrenska Academy at the University of Gothenburg, Institute of Neuroscience and Physiology, Division of Speech and Language Pathology, Box 452, SE-405 30 Gothenburg, Sweden; e-mail: [email protected] International Journal of Language & Communication Disorders C 2015 Royal College of Speech and Language Therapists ISSN 1368-2822 print/ISSN 1460-6984 online  DOI: 10.1111/1460-6984.12149

Effect of number of presentations on intelligibility

477

a commonly used method, and the results showed that there was a statistically significant difference depending on whether the material was presented once, twice or three times to listeners but that reliability was comparable for these three conditions. This implies, first, that it is crucial to report the number of presentations in studies including intelligibility assessment and, second, that increased reliability is probably not an argument in favour of using the more time-consuming and effortful practice of presenting the utterances repeatedly to the listeners. The paper also includes a discussion of the considerable variability in intelligibility scores found across individual listeners.

Introduction The concept of ‘intelligibility’ is complex and involves a great many aspects in addition to the speech signal, such as the listener’s characteristics, the type of speech material, the listener’s task and the listening condition; all these aspects affect the outcome of an intelligibility assessment (Kent 1992, 1996, McHenry 2011, Hustad and Cahill 2003, Johannisson et al. 2014). As stated by Kent et al. (1994), ‘the idea that a single intelligibility score can be ascribed to a given individual apart from listener and listening situation is somewhat a fiction’ (p. 81). Although this must be borne in mind when speech intelligibility is studied, assessing and describing a person’s level of intelligibility is still worthwhile, for instance to evaluate the effect of an intervention aimed at enhancing speech-mediated communication. There are many different approaches to the assessment of intelligibility. One common approach is to use lists of words or sentences that the speaker reads out or repeats after a model, whereupon listeners are asked to transcribe the speech material or answer multiple-choice questions about it (Hodge and Gotzke 2007, Zajac et al. 2006, Yorkston and Beukelman 1981, Morris and Wilcox 1999). The advantage of using such a structured speech material is, of course, that as the target words are known, it is relatively easy to calculate the percentage of the words that the listener perceives correctly. One disadvantage is that this approach may not yield a very accurate picture of the speaker’s ability to use speech for communication purposes in daily life (Kwiatkowski and Shriberg 1992, Morris et al. 1995). Transcribing spontaneous speech rather than lists of words or sentences is generally agreed to be a highly reliable method that gives a more ecologically valid picture of the speaker’s communicative ability (Kwiatkowski and Shriberg 1992, Whitehill 2002). There are two reasons for this: spontaneous speech gives a more realistic view of the speaker’s ability to produce speech, and it includes features that are normally available to the listener, such as prosodic and contextual ones. As regards listener tasks, there are various ones besides transcription that can be used to assess intelligibility in spontaneous speech. One is rating scales, where a number or a label is assigned to the degree of intelligibility on an equal-appearing scale (e.g. one ranging from 0 = ‘always

unintelligible’ to 9 = ‘always intelligible’). Another is direct magnitude estimation, where the listener makes a numerical estimate of the level of intelligibility relative to the perceived level of other speech samples presented to the listener. It has been argued that, for the purpose of intelligibility assessment, rating scales are less valid and reliable than transcription (Schiavetti 1992, Whitehill 2002). Tasks involving direct magnitude estimation do not have this problem of validity and reliability (Whitehill 2002), but they are dependent on the speech sample chosen as standard when a modulus has been assigned to that standard (Weismer and Laures 2002). This is a possible source of error which is avoided with a transcription task. In recent years, the focus of intelligibility research has increasingly shifted from the speaker to the listener (Hustad et al. 2011). Issues such as variation between listeners in intelligibility scores and the influence of listener characteristics have been discussed and investigated in several studies (Pennington and Miller 2007, McHenry 2011). By and large, na¨ıve listeners are considered to generate lower scores and to show more variability among themselves than experienced listeners (Pennington and Miller 2007), even though Keuning et al. (1999), assessing the intelligibility of cleft-palate speech, found no difference in intra-judge reliability between judges with and without expertise. Further, age, sex and familiarity with the speaker’s accent seem to have no substantial effect on intelligibility scores, at least not for dysarthric speakers (Pennington and Miller 2007). Another relevant factor is the extent to which the listener has access to contextual information—especially for speech with low intelligibility, since this may make the listener rely more on top-down processes such as making inferences on the basis of knowledge of the content rather than on bottom-up processes involving the interpretation of the actual speech signal (Hustad and Beukelman 2001, Kent 1996, Lindblom 1990, Garcia and Dagenais 1998). It is generally considered that longer speech stimuli (e.g. sentences) yield higher intelligibility scores than shorter ones (e.g. single words); this is probably because longer stimuli include more contextual cues. This could be of importance, especially if the speech signal is impaired (Hustad and Beukelman 2001, Lindblom 1990). However, findings from earlier studies comparing speech materials consisting of

478 narratives with ones consisting of unrelated sentences are contradictory (Hustad and Beukelman 2001). Studies of listener familiarization in relation to dysarthric speech have led to the conclusion that listeners tend to attain higher intelligibility scores when they listen several times to the same passage read by different speakers (and the tendency is even stronger if the familiarization phase is explicit, i.e. when listeners are instructed to read a transcript of the passage while listening to it during the familiarization phase) (Borrie, McAuliffe, Liss, Kirk, O’Beirne and Anderson., 2012). In another study, Hustad and Cahill (2003) let listeners hear the same speaker uttering different sentences and found that intelligibility scores increased for each trial, with a statistically significant improvement between the first and fourth trials. In line with this, a further aspect of familiarization that may reasonably be hypothesized to affect intelligibility scores is the number of times that the same speech material is presented to the listener. A review of some of the studies published in the past two decades concerning either the concept of intelligibility as such or the methods used to assess it shows that the number of presentations used varies across the research reported (table 1). There are relatively large methodological differences between these studies, and also differences in purpose. The speakers are sometimes children (Gordon-Brannan and Hodson 2000, Hodge and Gotzke 2007, 2011, Hustad et al. 2012, Keuning et al. 1999, Morris et al. 1995, Whitehill and Chau 2004, Wilson and Spaulding 2010, Zajac et al. 2010) and sometimes adults (Borrie et al. 2012, Haley et al. 2011, Hustad 2006, Hustad and Beukelman 2001, Hustad and Cahill 2003, Lillvik et al. 1999, McAuliffe et al. 2010, Sussman and Tjaden 2012, Whitehill and Chau 2004), with dysarthria due to cerebral palsy (Hustad and Beukelman 2001, Hustad and Cahill 2003, Hustad 2006, Hustad et al. 2012) or neurological diseases (Borrie et al. 2012, Lillvik et al. 1999, Sussman and Tjaden 2012). In other studies, the speakers have a speech– sound disorder (Hodge and Gotzke 2011), or the focus is on evaluation of an intervention relating to cleft palate (Keuning et al. 1999). The number of participants varies between 4 and 78, and there is also variation in the characteristics of listeners and the number of listeners. In this limited, and by no means complete, review of earlier studies, we were not able to see any pattern in the number of presentations of speech material chosen that could be linked to any of these variables. It seems logical to expect that intelligibility scores will be affected by the number of presentations in the same way as that found for other aspects of familiarization: intelligibility can be expected to increase with the number of presentations, such that the listener will understand an utterance better when hearing it for the second or third time. However, to our knowledge this has not yet been investigated systematically.

Tove B. Lagerberg et al. In other words, there is a lack of studies of the effect that the number of presentations may exert on the level of intelligibility scores. This issue is of interest when results from different studies (that may or may not have used different numbers of presentations) are compared, and it should be taken into account in the design of studies involving intelligibility assessment. Aim The aim of the present study was to investigate the impact of the number of presentations of the utterances on listener transcription in the context of intelligibility assessment. Specifically, the aim was to investigate the extent to which intelligibility scores, reliability and listener variability depend on the number of presentations of the same utterance from the same speaker. Method The design of the study included three (sequentially administered) different listener conditions. The listeners first heard and transcribed orthographically an utterance once (C1). They then heard the same utterance a second time and were told to modify their first transcription if they so desired (C2). Finally, when they had listened to all of the utterances produced by one child in this manner, they listened to the same utterances for a third time, one at a time and in the same order, and were again asked to modify their transcriptions if they felt this to be appropriate (C3). Participants The transcription data used in the present study represent a subset of those used in a parallel study (Lagerberg et al. 2014). Only the utterances produced by twelve out of the twenty participants in that study were used. This was because, to avoid a ceiling effect in terms of intelligibility scores, only children with a percentage of consonants correct (PCC) score of less than 90% were included in the present study. It is true that PCC scores offer no qualitative information, for example as regards places in word structure that cause difficulty, and thus offer no means of evaluating the structural constraints on a child’s phonological system. However, PCC was still deemed to be a suitable measure for singling out children who might be expected not to have problems with intelligibility. PCC scores had been calculated in accordance with Shriberg and Kwiatkowski (1982), except that single words (Swedish Articulation and Nasality Test; Lohmander et al. 2005) were used instead of spontaneous speech (cf., for example, McLeod et al. 2012 and Zajac et al. 2010 for a similar approach). This criterion yielded a study group of twelve children (age range 4:6–8:3 years; mean = 5:10 years, six males

Effect of number of presentations on intelligibility

479

Table 1. Number of presentations of speech material used in some studies involving intelligibility: an overview Study Morris et al. (1995) Lillvik et al. (1999) Keuning et al. (1999) Gordon-Brannan and Hodson (2000) Hustad and Beukelman (2001) Hustad and Cahill (2003) Whitehill and Chau (2004) Hustad (2006) Hodge and Gotzke (2007)

McAuliffe et al. (2010) Wilson and Spaulding (2010) Zajac et al. (2010) Haley et al. (2011) Hodge and Gotzke (2011) Borrie et al. (2012)a Hustad et al. (2012) Sussman and Tjaden (2012)

Speech material Single words Single words and nonsensical sentences read out loud Sentences read out loud and repeated after a model Single words and sentences repeated after a model and spontaneous speech Structured sentences repeated after a model Sentences repeated after a model

Number of presentations Not stated 2 ‘As often as desired’

Listener task Closed set Closed set and orthographic transcription Visual analogue scale

3 (maximum)

Closed set (words), orthographic transcription (sentences and spontaneous speech) 2 (first time only listening without Orthographic transcription transcribing) 2 (first time only listening without Orthographic transcription transcribing) Single words read out loud 1–2 (as per listener’s choice) Closed set Sentences (‘produced’) 2 Orthographic transcription Single words repeated after a 1 (single words); 1–2 as per the Closed set (single words) and model and spontaneous speech listener’s choice (spontaneous orthographic transcription speech) (single words and spontaneous speech) Spontaneous speech (30 s) 2 Direct magnitude estimation Sentences read out loud 1 Orthographic transcription Single words repeated after a 1 Orthographic transcription model Single words repeated after a 1 Orthographic transcription and model closed set Single words and spontaneous 1 (single words); 1–2 Closed set (words), orthographic speech (spontaneous speech) transcription (spontaneous speech) 1 (not explicitly stated) Orthographic transcription Sentences read out loud (nonsensical and semantically plausible ones) Single words and sentences 1 Orthographic transcription and repeated after a model rating Single words, sentences and 1 (not explicitly stated) Closed set (single words), narrative read out loud orthographic transcription (sentences), Visual analogue scale (narrative)

Note: a This study used the term ‘percentage of words correct’ (PWS), not ‘intelligibility’.

and six females) with PCC < 90%, who were labelled S1–S12 (table 2). These children had been recruited through speech–language pathologists (SLPs) working at the Department of Paediatric Speech and Language Pathology at the Queen Silvia Children’s Hospital in Gothenburg, Sweden, as well as through contacts with schools and pre-schools in the Gothenburg area. Ten of the twelve children had a speech–sound disorder of unknown origin. The other two had been recruited as members of the control group in the above-mentioned parallel study (Lagerberg et al. 2014); since their PCC scores were below the cut-off point of 90% used for the present study, they were included in it (table 2). All participants had normal hearing and Swedish as their strongest language, as reported by their parents. All parents gave their written consent for their children to participate in the study, and ethical approval was obtained from the Regional Ethical Review Board of Gothenburg. The listeners were 18 students enrolled in the SLP study programme of the University of Gothenburg

and two recent graduates from that programme. The students were spread out across years 1–4 of the programme, which probably means that they differed somewhat in their knowledge about phonetics and their experience of making orthographic transcriptions. This may have had an effect on their transcriptions. The listeners were all females and between 20 and 35 years old. All of them had normal hearing and Swedish as their strongest language, according to self-reports. The listeners were labelled L1–L20. Speech samples Spontaneous speech was elicited through conversation about anything that a given child wished to talk about, such as toys that were present in the room or things the child had done at school or on holiday. The recordings were made by two SLP students (not subsequently serving as listeners) using a TASCAM HD-P2 Portable High-Definition Stereo Audio Recorder with an

480

Tove B. Lagerberg et al. Table 2. Speaker and speech-material characteristics

Speaker S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 Mean/ratio

Age (years;months)

Sex

Diagnosis

PCC score

Mean number of words in the sample

Mean number of words per utterance

4:6 5:6 6:7 5:11 5:11 8:3 6:6 6:6 5:2 5:4 4:8 5:0

Male Female Male Female Male Male Female Female Male Female Male Female

SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD TD TD

59 60 54 61 70 81 65 49 61 61 79 72

112 94 50 102 97 109 100 102 90 114 52 112

7.40 6.10 1.70 5.70 9.10 7.90 5.90 5.20 6.80 6.20 5.10 4.50

5:10 (SD = 13 months)

6:6

63 (SD = 9)

94 (SD = 22)

6.00 (SD = 1.9)

Note: SSD, speech–sound disorder; TD, no known speech or language disorder; PCC, percentage of consonants correct; SD, standard deviation.

external SONY ECM-MS957 microphone placed on a table about 30 cm from the child. The children’s speech was recorded either in a quiet room at their school or pre-school, at the SLP clinic or at home, in part according to each family’s wishes. For each child, a speech sample was prepared, made up of utterances that included a total of approximately 100 words. The speech material consisted of consecutive utterances from the child’s spontaneous speech (the speech of the SLP student was removed), each edited to a length of 1–15 words using Audacity 1.3 Beta (Unicode). The editing was carried out manually, and the first author decided where to cut the sound files; as far as possible, cuts were made in accordance with semantically natural pauses so as not to interfere with content. Since these were the only features controlled for, this resulted in utterances of different length and intonation, which did not always constitute full sentences. Utterances consisting of only ‘yes’ or ‘no’ were not included. In five cases, the original recording contained fewer than 100 words (table 2). The sound files were transferred to software created especially for the study. For the purpose of automatic recording, playing and correction of the transcriptions, a set of Praat scripts was created (Boersma 2001). These scripts were placed in a ‘plug-in structure’ to be easily distributed across different computers throughout the testing phase. The scripts made it possible for the transcribers to listen to the recordings and simultaneously transcribe what they perceived. One script calculated the number of syllables not transcribed. The result was then automatically exported to a spreadsheet, which could easily be imported to statistical software for further analyses.

Assessment of intelligibility As mentioned above under ‘Method’, three different conditions (labelled C1, C2 and C3) were used for the assessment of intelligibility. The listeners first heard an utterance once and transcribed it (C1). Immediately afterwards they listened to the same utterance a second time and were told to modify their transcription if they felt this was appropriate (C2). When they had listened to all of the utterances produced by one child in this manner, they listened to those utterances for a third time, one at a time and in the same order, and were again asked to modify their transcriptions as appropriate (C3). When transcribing after the third presentation, listeners thus had more contextual information about the topics occurring in the utterances produced by the child in question. This represented an attempt to find out whether context (in a rather broad meaning) affected the ratings. Since it would have been too time-consuming for each listener to transcribe all speech samples, each speech material was assigned to four listeners. In the original study (Lagerberg et al. 2014) where twenty children were included, each listener transcribed speech material from four children. For the purposes of the present study, however, since transcription data relating to the eight children whose PCC scores exceeded 90% were not included, the number of transcriptions presented here per listener varies between two and four. Hence the listener burden was the same for all listeners; however, since each listener was assigned only a small sample of speech material, generalization across the group should be made with caution. Before performing the work to be used in the studies, each listener practised by transcribing five examples which she was able to discuss with the test

Effect of number of presentations on intelligibility leader—who also remained present to answer questions throughout the listening session. The listeners made their transcriptions and listened at their own pace, selecting a dialogue box when they were ready to listen to the next presentation or a new utterance. The sound files were played to the listeners using software specially developed for the study, and the listeners also made their transcriptions using that software. The order of the utterances was not counterbalanced but was always the same. The reason for this choice was to make the situation similar to a real one, where the listener would hear the utterances in the order in which they were spoken. The listeners used headphones (Sennheiser HD 212 Pro) and could adjust the volume as they wished. They were instructed to transcribe orthographically all words that they understood and to mark each syllable with ‘0’ when they did not understand. Word boundaries were to be marked with a space. Guessing was discouraged, because it had been found in a previous study that guesses reduced the reliability of this particular method to assess intelligibility (Johannisson, Lohmander and Persson, 2014). Further, the listeners were told to transcribe any onomatopoetic words and any revisions of phrases and words, but to leave out any repetition of syllables or phonemes and any filler words such as ‘eh’. The software calculated the number of syllables transcribed (by counting vowels) and the total number of syllables (by counting vowels and ‘0’s). The use of syllables when calculating intelligibility scores makes it possible to circumvent the problem of estimating the number of words in unintelligible strings, as discussed for example by Flipsen (2006). The intelligibility score for each participant was calculated as follows: Intelligibility = total number of syllables in transcribed words/total number of syllables × 100 Analytical strategies and statistics A number of questions and hypotheses were addressed. First, it was examined whether intelligibility scores generally increased as a function of the number of presentations of the utterances. In keeping with earlier studies (Hustad and Cahill 2003, Hustad and Beukelman 2001), it was expected that there would be such an increase due to familiarization (and to potential contextual support, given that in the third condition, the listeners had already heard the full speech sample). For each child, the mean intelligibility score calculated by all listeners was used to investigate whether there was any statistically significant difference in terms of intelligibility scores between the three listening conditions. Repeated-measure analyses of variance (ANOVAs) and a paired-samples t-test were used. The level of significance was set at p < 0.01 to control for the risk of type-1 errors.

481 Table 3. Mean intelligibility scores, standard deviations, and the lowest and highest scores under the three conditions N = 48

Mean score

SD

Minimum–maximum

75.7 78.3 81.1

18.2 18.4 18.4

29–97 29–99 29–99

C1 C2 C3

Second, it was examined whether reliability differed across the three conditions. A particularly high (or low) level of inter-judge reliability for any one of the conditions would be an important issue to consider both in future research and in clinical work. Reliability in terms of inter-judge reliability was investigated using intra-class correlation (ICC). The output from ICC is of two types: single measures and average measures. Single measures should be reported when an assessment method is intended for use by a single judge, i.e. in clinical work. By contrast, if an assessment method is designed to be used in research, where the mean score of several judges is frequently used, it is more appropriate to report average measures (Shrout and Fleiss 1979). The aim of the present study being to investigate the implications of using different listener conditions in both clinical and research contexts, both single and average measures are reported. Again, the mean intelligibility score obtained from all listeners was used. Third, listener variability was examined along with the relative performance of individual listeners across the three conditions. For this purpose, visual inspection of graphical representations of data was used. Results Intelligibility score The mean intelligibility score from all listeners to a given speech material increased by approximately three percentage points each time the listeners transcribed the same material; the variation was very similar for all three conditions (table 3). There was a statistically significant difference among the three listener conditions according to a within-group ANOVA (F = 41.34, p < 0.001). Paired-sample t-tests performed as a follow-up showed that the mean score increased with each presentation: C1–C2: t(47) = 4.55, p < 0.001; C1–C3: t(47) = 7.76, p < 0.001; C2–C3: t(47) = 5.58, p < 0.001). The effect size, expressed in r2 , was 0.68 for C1–C2; 0.84 for C1–C3; and 0.61 for C2–C3. Table 3 shows mean results for each condition. Reliability As mentioned above, inter-judge reliability was investigated using ICC. It has been suggested that ICC >

482

Tove B. Lagerberg et al.

Table 4. Inter-judge reliability as measured by intra-class correlation for the three conditions

Single measure Average measure

C1

C2

C3

0.50 0.80

0.58 0.85

0.46 0.93

0.75 should be interpreted as excellent reliability, ICC = 0.4–0.75 as fair to poor reliability, and ICC < 0.4 as poor reliability (Fleiss 1986). Using this yardstick, all three conditions investigated had excellent reliability as regards the average measure and fair to poor reliability as regards the single measure (table 4). Listener variability Individual differences between listeners were examined by means of visual inspection of graphical representations of data. These representations, presented in figure 1, show that, in some cases, there was a great deal of variation between listeners in their transcriptions of the same speech material from the same speaker. For instance, listener L16 (who transcribed the speech material from speakers S2 and S3) consistently produced scores which were considerably lower than those of the other three listeners who transcribed the same speech material; the difference was as large as 20–60 percentage points. Further, L6 (who transcribed S8 and S9) produced much lower scores than the other listeners who transcribed the same speech material, especially for S9; for S8, however, two listeners (L5 and L8) had high scores and two (L6 and L7) had low scores (figure 1). In one case, a speaker who yielded highly variable scores had produced a small speech sample (S3). Overall, however, listener variability did not vary consistently with the number of words or with utterance length. The listeners’ performance on the transcription task appeared to be stable in that those who differed from each other in the first condition did not differ either less or more from each other in the second or third condition. For a majority of the listeners, there was a tendency for the intelligibility score to increase slightly with each new transcription they made. Discussion The present study investigated the impact of the number of presentations of the utterances on the assessment of intelligibility using transcription of spontaneous speech. The listeners first heard and transcribed an utterance once (C1), then heard it a second time and were asked to modify their transcription if appropriate (C2), and when they had listened to all utterances by one child in this manner, they listened to the utterances again, one

at a time and in the same order, and were again asked to modify their transcription if appropriate (C3). The resulting intelligibility scores increased to a statistically significant extent with each presentation (C1–C2–C3). Reliability (ICC) was excellent for all conditions in terms of average measure and poor-to-fair for all conditions in terms of single measure. The scores obtained by individual listeners were very consistent across the three conditions, but some listeners’ intelligibility scores differed markedly from those of other listeners assessing the same child. Each of these findings is discussed below. The statistically significant differences in intelligibility scores found between the three conditions underlines the importance of controlling beforehand for, and reporting, the number of presentations of stimuli in research using intelligibility measures—a variable which has not previously been subject to systematic investigation. The number of presentations can be seen as one aspect of the more general concept of familiarization, which has been found to affect intelligibility scores. It is therefore relevant to relate the findings of the present study to those made in other studies of various aspects of familiarization. It turns out that the results of the present study are consistent with previous studies investigating listener familiarization with the speaker where intelligibility scores were compared for the same speaker but different speech materials (Hustad and Cahill 2003, Borrie et al. 2012). Hustad and Cahill (2003) used four trials for each speaker and found that for speakers with severe dysarthria, intelligibility increased gradually and the difference between the first and the third trials was statistically significant. For speakers with mild dysarthria, by contrast, the increase was greatest between the first and second trials and then levelled out. The present study found equal average increases between each of the three conditions (about 3 percentage points). This is not surprising, since the listeners heard all utterances from one child in the same order as they had been produced and so probably learned something about the nature of the speech signal and/or the linguistic content of the stimuli that they were able to use in their assessment. This approach is in line with the clinical situation as regards the assessment of intelligibility in spontaneous speech. In order to understand a spoken message, two processes are necessary: the inductive bottom-up process that assembles acoustic information into phonetically and linguistically meaningful units, and the deductive top-down process that draws on the listener’s linguistic–contextual knowledge on a higher level (Hustad and Beukelman 2001). It is interesting to note that in the present study, the increase in intelligibility score was not more pronounced for the third condition, where the listener, having previously heard all of the utterances produced by each child twice, had access to the full context (or at least what she

Effect of number of presentations on intelligibility

483

Figure 1. Listener variability and PCC score for individual speakers (S1–S12). Intelligibility scores are plotted against the y-axis and listening conditions against the x-axis. Each line represents one of the four listeners who assessed each speech material.

thought she understood from the context) and so should have been able to use top-down processes in support of the bottom-up process, whose reliability was reduced because of the impaired speech signal. This is in line with Hustad and Beukelman (2001), who investigated differences in intelligibility scores depending on the cues used (i.e. no cues, topic cues, alphabet cues and combined cues). They compared unrelated sentences with related sentences, finding that the overall results

showed no difference between these two conditions except as regards alphabet cues. In fact, when the listeners in the present study heard an utterance for a third time, they had probably already made their supra-segmental decisions relating to the content of the utterance, because, as stated by Kent (1996: 8), ‘speech perception is at times loosely derived from the acoustic signal. Knowledge-based hypotheses are pervasive influences.’ The findings from the present study can also be related

484

Tove B. Lagerberg et al.

Figure 1. Continued

to research done on the effect of familiarity with the speaker’s voice on perception of spoken words. Nygaard et al. (1994) found that perceptual learning of the speaker’s voice can facilitate the listener’s perception of words. However, the differences across the three presentations in this study were not as salient as in the study by Nygaard et al. This might be because the examination of the effect of learning in that study involved the listener assessing new words from the same speaker. The present study, by contrast, used repetition of the same

utterances. This might induce listeners to stick to their first interpretation of what they perceived—making greater use of their top-down process. The inter-judge reliability obtained in the present study is comparable to that found by other researchers, for example Zajac et al. (2010), who obtained an ICC of 0.94–0.99 using word lists, and Hodge and Gotzke (2007), who obtained 0.86 for children with typical speech development and 0.99 for children with cleft palate using spontaneous speech. Both of those studies

Effect of number of presentations on intelligibility used one presentation of the speech material to the listener. In general, studies rarely report the reason for choosing a certain number of presentations; one exception is the study by Gordon-Brannan and Hodson (2000), where recordings were played a maximum of three times to the listener in order to approximate a first-impression judgement while not forcing listeners to rely exclusively on short-term memory. One reason for using only one presentation could be to reduce the listener burden while using several could be justified by the expectation that this will make the listeners’ responses more certain. The results of the present study show that—at least for the participants involved—the use of more than one presentation did not substantially enhance inter-judge reliability. It is thus not possible to use reliability as an argument in favour of multiple presentations, at least not when orthographic transcription is used and the aim is solely to assess the level of intelligibility—rather than making a qualitative analysis to assist the planning of an intervention. Although this study does not evaluate intra-judge reliability, a parallel can be drawn to the consistency found between the same listener’s results for the three different conditions. It is clear that listeners are much more in agreement with themselves than with other listeners. Although the listeners in the present study were not involved in ratings, but only in transcriptions, this might be related to the discussion on internal standards of listeners (Kreiman et al. 1993). That is the listeners in the present study might have different internal standards or representations of the realization of the words included in the utterances they were asked to transcribe. An additional finding was of differences in the intelligibility scores attained by listeners to the same speaker. Although the overall inter-judge reliability was very high we found surprisingly great variations when examining the individual results for some listeners and speech materials, i.e. listener variability. This was not originally part of the aim of the study, but since the differences were so striking we still allow ourselves to discuss this point at some length here. These differences were evident in the present study even though the listeners were all females, about the same age and either enrolled in the SLP study programme or recent graduates from it. This is in line with earlier studies (McHenry 2011, Pennington and Miller 2007), suggesting that listener variability is due to some factor other than age, sex or experience. The five children (S2, S3, S8, S9 and S10) with the greatest variation in terms of the intelligibility scores obtained by the different listeners all had a PCC score below the mean value for the overall group. This is in line with the finding by Weiss (1982) that inter-judge reliability was lower, when intelligibility was reduced. Nevertheless, it is also important to bear

485 in mind that intelligibility will not only be influenced by the correct realization of a phoneme but will also depend, among other things, on where in a word the incorrect realizations of targets occur (Miller, 2013). The consistency of individual listeners’ performance across the three conditions suggests that the large differences in results across listeners are not due to some random factor, such as inattention or distraction, but really reflect differences in abilities, strategies or standards that are not neutralized by the administration of instructions. The ability to adapt to degraded speech and impaired intelligibility in different signal-to-noise ratios has been investigated in relation to non-speech auditory, cognitive and neuro-anatomical factors (such as amplitude modulation (AM), rating discrimination, working-memory capacity and brain-activation patterns of the listener) (Ljung et al. 2012, Erb et al. 2012). The results on working memory are somewhat contradictory, with some research indicating that a larger working memory buffer increases the ability to adapt to degraded speech while others fail to show such connections (cf. Erb et al. 2012). The present study used utterances, up to 15 words long, consisting of spontaneous speech, which appears to make the task used comparable to that of recalling speech, where working memory can be an important factor. The above-mentioned studies, together with the present one, underscore the need for additional research on the connection between listeners’ ability and performance on different intelligibility tasks. Studies of listener strategies are rare, but listener barriers (factors that listeners perceive as making speech more difficult to understand) and strategies have been studied in relation to ALS (amyotrophic lateral sclerosis) and HD (Huntington’s disease) (Klasner and Yorkston 2005). The specific strategy items that were strongly endorsed were segmental for ALS (e.g. ‘I tried to put sounds together to make words’) but supra-segmental for HD (‘I depended on breaks between words to help me understand the sentence’) (Klasner and Yorkston 2005). Hustad (2011) used the listening-strategy scale developed by Klasner and Yorkston (2005) to investigate whether listening strategies differed between listeners who obtained low and high intelligibility scores, respectively, for the same speech material. They found no difference in the variables they examined (i.e. segmental, supra-segmental, linguistic and cognitive; an example of the latter category being ‘I had to concentrate on understanding the sentence’). The authors suggest the development of a scale that specifically focuses on understanding what the strongest and weakest listeners do when trying to understand disordered speech. In the present study, the fact that guessing was discouraged might have affected the strategy that listeners used and caused them to rely less on context and adopt a more

486 word-by-word approach. The use of different strategies in listeners clearly remains an important question for future research. Some studies have excluded listeners who are considered to be outliers (Pennington and Miller 2007, Weismer and Laures 2002, Garcia and Dagenais 1998). This might be a reasonable way to handle large listener variability (such as that found in the present study), although it should be kept in mind that any such inclusion/exclusion cut-off will be rather arbitrary, given our current state of knowledge. Besides using the average score for several listeners, as is often done (Hodge and Gotzke 2007, Yorkston and Beukelman 1981), researchers might consider excluding listeners who are outliers as regards the intelligibility scoring. This would yield results that are more consistent with a speaker’s average intelligibility performance, since listeners with transcriptions that are very aberrant (like L6 and L16 in the present study) may give rise to average scores that are not representative of the speech material. Such a procedure would probably be relevant, depending on the purpose of the assessment; however, the investigation of the factors that affect listener comprehension and thus cause such large differences as were found here remains an important topic for future research. One limitation in the present study as regards the listener task is that the order of the utterances was not counterbalanced but was always the same. This might be considered as a potential threat to the reliability of the results. Finally, the present study has tentative clinical implications. If a test is to be used to monitor change, it is of critical importance that it should be stable. During speech therapy in a clinical setting, the test method has to be sensitive to the effect of intervention and to symptom progression but not to unrelated factors such as the number of presentations, familiarization or fatigue (Vogel and Maruff 2014). The present study has shown that it is important to use the same number of presentations and the same context when intelligibility is assessed using transcriptions of spontaneous speech, if the intention is to compare results. As regards implications for research, the present study shows that it is crucial to report the number of presentations used in any studies or descriptions of intelligibility. A further possible conclusion is that a single presentation of the utterances may yield sufficient reliability and ecological validity, while also making the listener’s burden more reasonable, since higher reliability was not obtained with repeated presentations of the utterances to the listener. This would make it feasible to have larger listener groups assess the same speaker—which may be necessary given that listener variability has proved to be large—or to include a larger number of speakers in studies. Further, the use of a single presentation would make transcription-based assessment using spontaneous

Tove B. Lagerberg et al. speech less time-consuming and therefore a more feasible tool in daily clinical work to be used instead of, or as a complement to, rating scales or other methods. Acknowledgements The authors wish to thank the SLPs Anna-Karin Ahlman, MSc, and Andrea B¨orjesson, MSc, for valuable help with the collection of data; the research engineer Jonas Lindh for help with the software; and all the SLP students for valuable help with the transcription of the speech samples. Declaration of interest: This study was funded in part by the Swedish Research Council and the Petter Silfverski¨old Memorial Fund. Responsibility for the content of this article lies with the authors. The authors report no declarations of interest.

References BOERSMA, P., 2001, Praat, a system for doing phonetics by computer. Glot International, 5, 341–345. BORRIE, S. A., MCAULIFFE, M. J., LISS, J. M., KIRK, C., O’BEIRNE, G. A. and ANDERSON, T., 2012, Familiarisation conditions and the mechanisms that underlie improved recognition of dysarthric speech. Language and Cognitive Processes, 27, 1039– 1055. ERB, J., HENRY, M. J., EISNER, F. and OBLESER, J., 2012, Auditory skills and brain morphology predict individual differences in adaptation to degraded speech. Neuropsychologia, 50, 2154– 2164. FLEISS, J. L., 1986, The Design and Analysis of Clinical Experiments (New York, NY: Wiley). FLIPSEN, P., 2006, Measuring the intelligibility of conversational speech in children. Clinical Linguistics and Phonetics, 20, 303– 312. GARCIA, J. M. and DAGENAIS, P. A., 1998, Dysarthric sentence intelligibility: contribution of iconic gestures and message predictiveness. Journal of Speech, Language and Hearing Research, 41, 1282–1293. GORDON-BRANNAN, M. and HODSON, B. W., 2000, Intelligibility/severity measurements of prekindergarten children’s speech. American Journal of Speech–Language Pathology, 9, 141–150. HALEY, K. L., ROTH, H., GRINDSTAFF, E. and JACKS, A., 2011, Computer-mediated assessment of intelligibility in aphasia and apraxia of speech. Aphasiology, 25, 1600–1620. HODGE, M. and GOTZKE, C. L., 2007, Preliminary results of an intelligibility measure for English-speaking children with cleft palate. Cleft Palate Craniofacial Journal, 44, 163–174. HODGE, M. M. and GOTZKE, C. L., 2011, Minimal pair distinctions and intelligibility in preschool children with and without speech sound disorders. Clinical Linguistics and Phonetics, 25, 853–863. HUSTAD, K. C., 2006, Estimating the intelligibility of speakers with dysarthria. Folia Phoniatrica et Logopaedica, 58, 217–228. HUSTAD, K. C. and BEUKELMAN, D. R., 2001, Effects of linguistic cues and stimulus cohesion on intelligibility of severely dysarthric speech. Journal of Speech, Language and Hearing Research, 44, 497–510. HUSTAD, K. C. and CAHILL, M. A., 2003, Effects of presentation mode and repeated familiarization on intelligibility of dysarthric speech. American Journal of Speech–Language Pathology, 12, 198–208. HUSTAD, K. C., DARDIS, C. M. and KRAMPER, A. J., 2011, Use of listening strategies for the speech of individuals with dysarthria

Effect of number of presentations on intelligibility and cerebral palsy. Augmentative and Alternative Communication, 27, 5–15. HUSTAD, K. C., SCHUELER, B., SCHULTZ, L. and DUHADWAY, C., 2012, Intelligibility of 4 year old children with and without cerebral palsy. Journal of Speech, Language and Hearing Research. 55, 1177–1189. JOHANNISSON, T. B., LOHMANDER, A. and PERSSON, C., 2014, Assessing intelligibility by single words, sentences and spontaneous speech: a methodological study of the speech production of 10-year-olds. Logopedics, Phoniatrics and Vocology, 39, 159–68. KENT, R. A., 1992, Intelligibility in Speech Disorders: Theory, Measurement and Management (Amsterdam: Benjamin). KENT, R. D., 1996, Hearing and believing: some limits to the auditory–perceptual assessment of speech and voice disorders. American Journal of Speech–Language Pathology, 5, 7–23. KENT, R. D., MIOLO, G. and BLOEDEL, S., 1994, The intelligibility of children’s speech: a review of evaluation procedures. American Journal of Speech–Language Pathology, 3, 81–95. KEUNING, K. H., WIENEKE, G. H. and DEJONCKERE, P. H., 1999, The intrajudge reliability of the perceptual rating of cleft palate speech before and after pharyngeal flap surgery: the effect of judges and speech samples. Cleft Palate Craniofacial Journal, 36, 328–333. KLASNER, E. R. and YORKSTON, K. M., 2005, Speech intelligibility in ALS and HD dysarthria: the everyday listener’s perspective. Journal of Medical Speech–Language Pathology, 13, 127– 139. KREIMAN, J., GERATT, B. R., KEMPSTER, G. B., ERMAN A. and BERKE, G. S., 1993, Perceptual evaluation of voice quality: review, tutorial, and a framework for future research. Journal of Speech and Hearing Research, 36, 21–40. KWIATKOWSKI, J. and SHRIBERG, L. D., 1992, Intelligibility assessment in developmental phonological disorders: accuracy of caregiver gloss. Journal of Speech and Hearing Research, 35, 1095–1104. LAGERBERG, T. B., A˚ SBERG, J., HARTELIUS, L. and PERSSON, C., 2014, Assessment of intelligibility using children’s spontaneous speech: methodological aspects. International Journal of Language and Communication Disorders, 49, 228–39. ¨ , P. and HARTELIUS, L., LILLVIK, M., ALLEMARK, E., KARLSTROM 1999, Intelligibility of dysarthric speech in words and sentences: development of a computerised assessment procedure in Swedish. Logopedics, Phoniatrics and Vocology, 24, 107– 119. LINDBLOM, B., 1990, On the communication process: speaker– listener interaction and the development of speech. Augmentative and Alternative Communication, 6, 220–230. LJUNG, R., ISRAELSSON, K. and HYGGE, S., 2012, Speech intelligibility and recall of spoken material heard at different signal-tonoise ratios and the role played by working memory capacity. Applied Cognitive Psychology, 27, 198–203. LOHMANDER, A., BORELL, E., HENNINGSSON, G., HAVSTAM, C., LUNDEBORG, I. and PERSSON, C., 2005, SVANTE: svenskt artikulations- och nasalitets-test (Skivarp: Pedagogisk design). MCAULIFFE, M. J., CARPENTER, S. and MORAN, C., 2010, Speech intelligibility and perceptions of communication effectiveness

487 by speakers with dysarthria following traumatic brain injury and their communication partners. Brain Injury, 24, 1408– 1415. MCHENRY, M., 2011, An exploration of listener variability in intelligibility judgments. American Journal of Speech–Language Pathology, 20, 119–123. MCLEOD, S., HARRISON, L. J. and MCCORMACK, J., 2012, Intelligibility in Context Scale: validity and reliability of a subjective rating measure. Journal of Speech, Language and Hearing Research, 55, 648–656. MORRIS, S. R. and WILCOX, K. A., 1999, The Children’s Speech Intelligibility Measure (San Antonio, TX: Psychological Corp.). MORRIS, S. R., WILCOX, K. A. and SCHOOLING, T. L., 1995, The Preschool Speech Intelligibility Measure. American Journal of Speech–Language Pathology, 4, 22–28. NYGAARD, L. C., SOMMERS, M. S. and PISONI, D. B., 1994, Speech perception as a talker contingent process. Psychological Science, 5, 42–46. PENNINGTON, L. and MILLER, N., 2007, Influence of listening conditions and listener characteristics on intelligibility of dysarthric speech. Clinical Linguistics and Phonetics, 21, 393–403. SCHIAVETTI, N., 1992, Scaling procedures for the measurement of speech intelligibility. In R. D. KENT (ed.), Intelligibility in Speech Disorders (Amsterdam: John Benjamins), 11–34. SHRIBERG, L. D. and KWIATKOWSKI, J., 1982, Phonological disorders III: a procedure for assessing severity of involvement. Journal of Speech and Hearing Disorders, 47, 256–270. SHROUT, P. E. and FLEISS, J. L., 1979, Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86, 420–428. SUSSMAN, J. E. and TJADEN, K., 2012, Perceptual measures of speech from individuals with Parkinson’s disease and multiple sclerosis: intelligibility and beyond. Journal of Speech, Language and Hearing Research, 55, 1208–1219. VOGEL, A. P. and MARUFF, P., 2014, Monitoring change requires a rethink of assessment practices in voice and speech. Logopedics, Phoniatrics and Vocology, 39, 56–61. WEISMER, G. and LAURES, J. S., 2002, Direct magnitude estimates of speech intelligibility in dysarthria: effects of a chosen standard. Journal of Speech, Language and Hearing Research, 45, 421– 433. WEISS, C. E., 1982, Weiss Intelligibility Test (Tigard, OR: CC Publ.). WHITEHILL, T. L., 2002, Assessing intelligibility in speakers with cleft palate: a critical review of the literature. Cleft Palate Craniofacial Journal, 39, 50–58. WHITEHILL, T. L. and CHAU, C. H., 2004, Single-word intelligibility in speakers with repaired cleft palate. Clinical Linguistics and Phonetics, 18, 341–355. WILSON, E. O. and SPAULDING, T. J., 2010, Effects of noise and speech intelligibility on listener comprehension and processing time of Korean-accented English. Journal of Speech– Language and Hearing Research, 53, 1543–1554. YORKSTON, K. M. and BEUKELMAN, D. R., 1981, Assessment of Intelligibility of Dysarthric Speech (Tigard, OR: CC Publ.). ZAJAC, D., PLANTE, C., LLOYD, A. and HALEY, K., 2010, Reliability and validity of a computer mediated single-word intelligibility test: preliminary findings for children with repaired cleft lip and palate. Cleft Palate Craniofacial Journal, 48, 538–548.

Effect of the number of presentations on listener transcriptions and reliability in the assessment of speech intelligibility in children.

The assessment of intelligibility is an essential part of establishing the severity of a speech disorder. The intelligibility of a speaker is affected...
460KB Sizes 0 Downloads 5 Views