Br. J. med. Psycho/. (1977). 50. 367-373

Printed in Great Britain

367

Voice analysis for the measurement of anxiety G . A. Smith A recently developed technique for the acoustical analysis of speech is described. Speech is analysed electronically for the presence or absence of a microtremor having a frequency of about 10 Hz.This tremor is said to be attenuated in states of psychological stress. The paper presents data supporting the validity of this as a measure of anxiety, using states of both normal and pathological anxiety. An objective scoring system is proposed to overcome some of the problems of unreliability. A number of practical advantages of the voice technique are described.

In searching for acoustical indicants of stress, there are a number of elements of vocalization which can be selected. Two of the basic elements are the intensity of the voice and also its frequency composition. The relationship of these to psychological stress or anxiety has been investigated by a number of workers, including Alpert, Kurtzburg & Friedhoff (1%3), Hecker, Stevens, von Bismark & Williams (1968), Rubenstein (1966), and Williams and Stevens (1%9). These investigations have been technically difficult and their results complex. More recently, there has been interest in another type of acoustic indicator. This involves the existence of irregularities or rhythmic modulations of the acoustic signal. Although common sense might indicate a search for phenomena related to the type of tremor visible in the body of a person in a state of anxiety and audible in his voice, interest in fact has evolved around the discovery of a microtremor which exists in states of relaxation. A hypothesis has developed which relates this microtremor of the voice to the phenomenon of physiological tremor, the existence of which has been known since the end of the 19th century, and investigated more recently by Lippold (1%7). He noted that the normal contraction of voluntary muscle is accompanied by a slight oscillation at about 10 Hz, and that this physiological tremor may be attenuated in states of arousal. Voice analysing equipment has been produced by Dektor (1971), with the claim that it displays physiological tremor of the voice mechanism, and that this tremor is seen to be attenuated by psychological stress. Papqm (1974) noted that this claim depends upon a whole series of propositions, none of which have been clearly validated in published research. The present author is not equipped to provide any evidence for the physiological elements of the theory. However, there is no doubt that a tremor of some sort, with a frequency range of 8-14 Hz, does manifest itself in the acoustic signal of speech. The observation of this slow modulation superimposed upon the audible voice frequencies is an easy one, using the equipment described below. This paper is concerned only with the empirical issues of the reliability and validity of this form of stress measurement. Studies of these issues have taken several forms: (1) The voices of persons under known stress have been compared with the voices of persons under less stress, for example Brenner (1974), Smith (1974) and Worth & Lewis (1975). (2) The voices of persons known to be lying have been investigated for short-term changes at the moment of lying, with the assumption that lying would produce stress, for example Barland (1974). In view of the movement towards the application of voice analysis in lie-detection, the ethical issues have been discussed by Smith (1975). (3) Voice changes have been compared with stress responses indicated by more established techniques such as the polygraph. This has been done mainly in lie-detection situations, for example Barland (1974).

368 G. A. Smith In general, there is a promising amount of data pointing towards some degree of validity. No investigations of emotional states other than anxiety have been made yet. The aim of the present paper is to replicate some of the earlier work, and to investigate some refinements of the general technique.

Method The equipment A high quality four-speed reel-to-reel tape-recorder, the Uher 2000, is used to gather the voice sounds. Alternatively, tapes made with other recorders may be copied on to the Uher. Such copying, as well as transmission over telephone or radio channels, involves some loss of signal quality, but in general this is not crucial. The Uher tape is later played through an audio lead to the analysing equipment, taking advantage of the four possible speeds of playback. Normally, the tape is processed at 1/4 or 1/8 of recording speed. The analysing equipment is a Dektor Psychological Stress Evaluator, PSE-1. A later mode, PSE-101, is similar. Its output is displayed as pulse-shaped oscillations drawn on a moving paper chart by a heat pen. The chart speed is normally 20 mmlsec, so that with a tape speed reduction of 1/4 the time scale is 80 mmlsec. The general function of the PSE is to convert complex waveforms into much simpler waveforms, by selecting a dominant frequency. As a simple example, an input of two pure tones, one at 200 Hz and the other at 250 Hz, is converted into a simple 50 Hz output, this being the frequency of the highest peaks in the resultant of the original two tones. With a voice as input, the PSE output is the dominant or fundamental frequency of the voice. The graph indicates the amplitude of each sound oscillation which occurs at this frequency. The average male voice is displayed as pulses at about 80-160 Hz, and the average female voice at about 160-320 Hz. Voices transmitted through poor quality channels, such as a bad telephone link, may show higher frequencies, presumably because of the channel's poor response at the lower frequencies. There are four Mode buttons. Modes I, I1 and 111 control the degree of smoothing of the oscillations, with Mode I giving the greatest smoothing. Mode IV is insensitive to square wave input, as opposed to sine wave, and is useful if noise filtering is needed. In general, mode 111 is the most used, as in this paper.

The chart rating systems With ordinary speech as input, the PSE output consists of a series of responses. A PSE response is defined here as a rise of the pen from its zero base-line, followed by a number of pulses of varying amplitude at the fundamental frequency, and completed by a fall of the pen to its base-line. A response typically lasts for about 0 . 1 4 2 sec, and corresponds to a spoken word, although it can correspond to one syllable or to more than one word, depending on the extent to which the speech is disjointed or continuous. Two scoring systems were compared, having different practical and theoretical implications: (1) Rating by visual inspection. This is illustrated in Fig. 1, and it is the original Dektor (1971) system condensed to a three-point scale. It is based upon the 10 Hz tremor principle, and it is useful to have a 10 Hz waveform drawn upon a piece of transparent plastic as an aid. With practice, this method is a rapid one.

-

Figure 1. Visual rating criteria. (a) Scored 0 (no stress). Modulation good, with distance between peaks (indicated) about 0.1 sec. (b) Scored 1 (some stress). Some attenuation of the modulation. (c) Scored 2 (good stress). No modulation.

Voice analysis for the measurement of anxiety 369

X I

I

I I

I I

I

-

I

I

I

I

I

I I

I

Y

Figure 2. Objective scoring. Measure the height of the response, and find its midpoint (indicated). Measure the time ( X ) for which the response is above this midpoint, as well as the total time of the response (Y). The score is X/Y,expressed as a percentage. (2) Scoring by objective measurement. This is illustrated in Fig. 2, and it is of the author’s own devising. It is only one of a number of possible methods; comparison with further scoring systems is beyond the scope of this paper. Some thought will show that, although it is based upon the idea of tremor, it is not related to any particular frequency of tremor. As the measurements have to be done by hand, the method is rather more tedious than visual inspection

Experiment I: Stress in broadcasting The broadcasting situation was selected as an example of a conveniently accessible source of speech data in which hypotheses may be made about where stress ought to be found. The main hypothesis was that professional broadcasters would be under a lesser degree of stress, because of their experience of the situation, than members of the public, who would be subject to the anxiety of a new experience. Radio ‘phone-in’ programmes are a convenient source of subjects from the general public, and the hypothesis that they would be more anxious would be reinforced by the fact that they are calling about personal problems.

Subjects Public broadcasts on FM radio by the BBC and IBA stations were monitored for one afternoon and evening, and recorded by a domestic radio cassette recorder. The cassettes were copied on the Uher recorder. Samples of speech were obtained from 42 people during this period. The aim was to take at random from each subject’s speech a sample section containing ten consecutive PSE responses. Previous studies by Smith (1974) had suggested that about ten responses would give an optimum degree of reliability and validity, in the context of the measurement of small differencesin anxiety. Seven subjects had to be eliminated, either because they were unable to provide ten consecutive responses, or because the noise level was extremely high. The remaining 35 subjects fell into three groups: (a) 12 professional broadcasters, mainly journalists, talking in the studio. There were 8 men and 4 women. (b) 10 professional broadcasters, all journalists, telephoning news reports into the studio. There were 9 men and 1 woman. 13

MPS

50

370 G . A . Smith (c) 13 members of the public telephoning an IBA station. Most were calling a ‘problems’ programme, although none seemed to have any particularly severe problems. Most were ‘requesting prayers’ for friends and relatives who were ill. There were 4 men and 9 women. The reason for having the two groups of professionals was to control for any possible effects of the telephone channel of transmission, both as an electronic variable and as a stress inducer. The groups were clearly different in sex composition, but previous studies had given no reason to suppose that this would be relevant to the hypothesis being tested.

Stress rating The 35 samples of speech, one from each subject, were processed by the PSE. 1/4 tape speed was used for males, and 1/8 for females because of their higher fundamental voice frequency. The individual charts of the 350 responses were coded, and then separated and mixed up so that a blind rating could be done. As might be expected, the higher levels of stress were not seen, so that only the ratings 0 and 1 were used in the visual inspection method. Each subject’s stress score was defined as the sum of his separate scores on each of his ten responses. The maximum possible range of scores therefore could be from 0 to 20. With the objective scoring method, each subject’s stress score was defined as the average of his ten response scores. The range would have an upper limit of 100, and a lower limit of somewhere between 0 and 50.

The results are given later.

Experiment 11: Phobic anxiety In previous work by Smith (1974) it had been observed that some neurotic patients showed a high degree of PSE stress, whereas others showed very little. A similar observation was made by Lader & Mathews (1968), who used a skin conductance measure to demonstrate a higher degree of arousal in anxiety states, agoraphobics and social phobics than in specific phobics and normals. The aim of the present experiment was to replicate this finding in terms of the PSE.

Subjects These were 18 neurotic out-patients on the therapeutic case list of the Psychology Department at the time of the study. There were 10 men and 8 women, and most had been referred very recently. In terms of psychiatric diagnosis there were agoraphobics, social phobics, obsessionals, hypochondriacs, and mild depressives, but no specific phobics. They were allocated to two groups: (1) ‘Phobia’ were defined as those in whom the clinical and experimental situation itself would be considered particularly anxiety provoking. These were patients with agoraphobia (who had come a long way from home to see the therapist), and patients with social phobia (who now faced this novel scrutiny). There were eight patients in this group. (2) “on-phobics’ were defined as those for whom the clinical and experimental situation bore no particular resemblance to their anxieties. The obsessionals, hypochondriacs, and mild depressives were put in this group. There were ten of them. Fifteen normals were used for comparison. They comprised student nurses and other professionals.

Procedure The subject sat in an easy chair, with the Uher microphone fairly close to his mouth. He was asked to count aloud from 1 to 10 at a comfortable speed. The recordings were done in normal office conditions, with a certain amount of noise in the background. It was explained to the subject that the idea was to measure his anxiety level from his voice. The recordings were processed by the PSE, and each subject’s ten utterances were rated visually on the 20-point scale used in the previous experiment. The ratings were not done ‘blind’, as before, however. Objective scores were also taken.

Results Reliability A split-half procedure was used, and an assumption of a normal distribution was appropriate. Over all 68 subjects, the following Pearson correlations were obtained:

Voice analysis for the measurement of anxiety 371 (1) Visual ratings. Split-half reliability was given by r = 0.24, which is significant with P = 0.03. By the BrownSpearman formula, this corrects to r = 0.39 as the reliability of the full ten items. (2) Objective scores. The corresponding figures were r = 0.44 ( P = 0-0002), correcting to r = 0.61. Hence the objective scoring system seemed to be of superior reliability, although the relatively low range of scores may have led to some unfairness in this comparison.

Correlation between visual ratings and objective scores An assumption of a normal distribution was appropriate. The correlation between visual and objective scores over the 68 subjects was given by Pearson r = +0.61 (highly significant, P = O~ooool). Experiment Z (1) Visual ratings. The mean score for the professional broadcasters in the studio was 1-33 (s = 0.99); for the professionals on the telephone it was 1.20 (s = 0-92). These means were not significantly different. For the members of the public the mean score was 3.54 (s = 1.76). By Mann-Whitney U test (because of the different variances), the members of the public were significantly more stressed than the professionals (P=0401). (2) Objective scores. The corresponding mean scores were 46.9 (s = 6.57) and 50.0 (s = 6.50) for the professionals, with 61.0 (s = 7.35) for the members of the public. By t test, the difference between the professionals and the members of the public was again significant (P=0401).

Experiment ZZ (1) Visual ratings. The mean stress score of the normals was 2.13 (s= 1.36). For the non-phobic patients it was 3.10 (s= 1.29). and for the phobics it was 5.37 (s = 0.52). By Mann-Whitney U test, the phobics were significantly more stressed than the non-phobics (P=04002). By t test, the non-phobics were only marginally more stressed than the normals ( P = 0.05). (2) Objective scores. The mean stress score for the normals was 57.6 (s = 8-10). For the non-phobic patients it was 59.2 (s = 3-36), and for the phobics it was 64.2 (s = 4-08). By t test, the phobics were significantly more stressed than the non-phobics (P=0.01). By Mann-Whitney U test, the non-phobics were not significantly more stressed than the normals. Table 1 shows the distribution of low and high scores in the main groups of Expts I and 11, for the objective scoring system. Defining a visual rating of 4 or more as a high score gives an almost identical distribution for the visual rating system. Table 1. Experiments I and 11. Numbers in each group scoring low stress (objective score less than 60) and high stress (more than 60) Professional

Phone-in

boradcasters

callers

Normals and non-phobics

Phobics

Low stress

22

I

15

High stress

0

6

10

1 7

Percentage of group scoring high

0

46

40

87

13-2

372 G. A . Smith Discussion The data support the PSE as a measure of anxiety. ‘Stress blocking’ of the voice patterns generally appeared where it should appear, i.e. in situations of both normal and pathological anxiety, and in the right proportions. The results of these experiments were particularly clear, supporting the various modifications developed by the author during three years of experimentation with the PSE. What was absent from the early PSE techniques was a sufficient recognition of the problem of reliability, with its need for a sufficient quantity of voice data from the subject. Attempts to devise more precise and objective scoring systems are highly desirable from the point of view of rater reliability. However, subject unreliability is equally important. For example Smith (1974) observed that hyperventilation had an effect on the PSE patterns unconnected with psychological stress, producing ‘false’ stress blocking. Of course, no universal figure for the degree of required reliability could ever be set. Stress is naturally a fluctuating condition, and the degree of reliability needed for its measurement has to be judged according to the circumstances. The objective scoring system proposed in this paper seems to offer a useful alternative to the visual inspection method. The visual method is the one currently used ‘in the field’, and despite the inferior reliability shown by it in this paper, it should not be denigrated lightly. The range of stress levels observed in this paper was fairly small, and when a greater range is involved the visual method may be perfectly adequate, as well as much more convenient. However, the success of the objective scoring system does raise theoretical problems which cannot be settled by the present data. As mentioned above, it is based upon the idea of tremor or amplitude variation, but it is not fundamentally related to any particular frequency of variation. This opens a door to the investigation of the supposed relationship of the PSE to physiological tremor. Attention should be drawn to the practical virtues of the PSE as a psychophysiological technique. It is particularly non-invasive in character, not even requiring electrodes to be attached to the subject. The subject and the PSE can be separated in space and time to any desired extent. Some of these characteristics have led to fears of surveillance without the knowledge or consent of the subject, for example over the telephone. Ethical standards clearly must be considered (Smith, 1975). There are several possible directions for further research: (1) the addition of the PSE to other psychophysiological devices in research on emotional states, or its use in situations where no other device would be applicable; (2) the development of specific assessment techniques such as liedetection or the identification of fears; (3) the study of verbal interactions and the communication of emotions through speech. Acknowledgement The financial support of Powick Hospital and the West Midlands Regional Health Authority is acknowledged.

References ALPERT,M., KURTZBURG, R. L. & FRIEDHOFF, A. J. (1%3). Transient voice changes associated with emotional stimuli. Archs gen. Psychiat. 8 , 362-365. BARLAND, G. H. (1974). Use of voice changes in the detection of deception. (Abstract.) J. acoust. Soc. Am. 55, 423. BRENNER, M . (1974). Stagefright and Stevens’ Law. (Unpublished. Available from Dektor.) DEKTOR (1971). The psychological stress evaluator. Dektor C.I.S. Inc., 5508 Port Royal Road, Springfield, Va. 22151, USA.

HECKER,M. H. L., STEVENS,K. N., von BISMARK, G . & WILLIAMS, C. E. (1968). Manifestations of task-induced stress in the acoustic speech signal. J. acoust. SOC.Am. 44, 993-1001. LADER,M. H. & MATHEWS,A. M. (1%8). A physiological model of phobic anxiety and desensitization. Behav. Res. Ther. 6, 41 1421. LIPPOLD,0. C. J. (1967). Electromyography. In P. H. Venables & I. Martin (eds), A Manual of Psychophysiological Methods. Amsterdam: North Holland. PATUN, G. (1974). The effects of psychological

Voice analysis for the measurement of anxiety stress on speech: Literature survey and background. (Abstract.) J. acoust. SOC.Am. 55, 422423. L. (1966). Electro-acoustical RUBENSTEIN, measurement of vocal responses to limited stress. Behav. Res. Ther. 4, 135-138. SMITH, G. A. (1974). The measurement of anxiety: a new method by voice analysis. I.R.C.S., Med. Sci. 2, 1707.

373

SMITH, A. (1975). Secret lie detector in the lab. New Scientist 67, 476-478. K.N. (1%9). On WILLIAMS,C . E. & STEVENS, determining the emotional state of pilots during flight: An exploratory study. Aerospace Med. 40, 1369-1372. WORTH,J. W. & LEWIS, B. (1975). Presence of the dentist: A stress evoking cue? Virginia Dental J. (In Press.)

Received 13 April 1976; revised version received 20 September 1976 Requests for reprints should be addressed to G. A. Smith, Senior Clinical Psychologist, Powick Hospital, Powick, Worcester WR2 4SH.

Voice analysis for the measurement of anxiety.

Br. J. med. Psycho/. (1977). 50. 367-373 Printed in Great Britain 367 Voice analysis for the measurement of anxiety G . A. Smith A recently develop...
422KB Sizes 0 Downloads 0 Views