Image processing system for interpreting motion in American Sign Language.

Communications Image processing system for interpreting in American Sign Language C. Charayaphan Electrical Canada

motion

and A.E. Marble

Engineering

Department,

Technical

University

of Nova Scotia,

Halifax,

N.S.,

Received October 1991, accepted February 1992

ABSTRACT In thti paper, an imageprocessing algorithm ispresented&r the interpretation of the American Si&n Language (ASL), which is one of the sign languages used by the majority of the deaf community. The process involves detection of hand motion, tracking the hand location based on the motion and classiJication of signs using adaptive clustering of stop positions, simple shape of the trajectory, and matching of the hand shape at the stop position. Keywords: American sign language, image processing, hand motion, adaptive clustering

INTRODUCTION Hearing impairment is the most common communication disorder affecting about 8% of the world’s population and there are about 200000 hearing hearing impaired in Canada ‘. Mild or moderate losses comprise the majority of hearing impairments. Conventional sound amplification with electronic hearing aids is the standard method of rehabilitation for this im airment but can be of little or no help for a severe an g profound hearing loss. Two conventional devices which provide communication for the deaf are the teletypewriter and the video telephone. While the teletypewriter is slow and tedious, the more roblem of attractive video telephone faces the limited communication channel bandwi B th. There are two basic approaches to the problem of communication between those with normal hearing and the deaf. The first involves direct electrical stimulation of the auditory system. Cochlea implants directly stimulate the auditory nerve by bypassing the non-functional hair cells. This techni ue requires the some intact auditory nerve fibres, ? imiting number of potential deaf candidates. The second approach is the use of sensory substitution methods which can be used by virtually every deaf patient and therefore is more widely used for artificial hearing. Both approaches share many problems among which are: the issues of single versus multiple channels; dynamic range limitation; susceptibility to noise and the requirement of extensive training for congenitally deaf atients. Because of these limitations, the deaf usual Py communicate the hearing by means of writing or by using an interpreter. Among the deaf themselves, sign language and finger spelling are the most common means of Correspondence

and reprint requests to:

0 1992 Butterworth-Heinemann 0141-5425/92/05419-07

communication. A fluent signer can communicate more rapidly than a hearing person using spoken language, and he can do this without missing the full range of expression’. Sign language is not simply English expressed with hands rather than the voice. Each sign is differentiated from the others by hand shapes, hand orientations, type of movements, and lace of articulation. In American Sign Language PASL), there are 45 distinct hand shapes, 10 hand orientations, 10 movements, and 25 locations. Because it requires no artificial hearing instrument and is efficient, the great majority of deaf adults choose sign language as their primary language. So far, very little research related to sign language has been initiated. Tartter and Knowlton carried out experiments on perception of American sign language from an array of 27 moving spots strategically placed on the hands and face2. The results suggested the possibility of transmitting signs using the telephone line bandwidth Sperlin$ tested the minimum bandwidth requirement of only approximately 20 kHz for a simple raster scan video transmission of ASL and finger spelling. He developed an intelligible encoding of ASL image sequences at extremely low information rates and found that the most effective code for grey-scale images is an 2880 Hz analogue raster code which requires bandwidth4. These results suggested that a great deal of human decision is ap lied in interpreting the signs. Kawai and Tamura B eveloped a system which can recognize the same parts of speech and generate the corresponding animation-like Japanese sign in real time5. Tamura and Kawasaki developed an image processing system which could recognize 20 Japanese sign motions based on matching simple cheremes6. The results suggested the possibilities of using image processing techniques to recognize si language. Since sign language and finger spe $ling have been

for BES J. Biomed. Eng. 1992, Vol. 14, September

419

Image processing

system for sip language: C. Charayaphan and A. E. Marble

and will be the most common means of communication among the deaf, and because it is impractical to use them to communicate with hearing people, the dedicated computer interpretative system which would convert signs to speech and vice versa would have great potential benefit. in this paper we present an image processing algorithm for the interpretation of ASL. The process involves detection of hand motion using differences of grey-scale intensity, tracking the hand location using the approximation from the motion information, and classification of signs using adaptive clustering. Experiments are presented which test the performance of the algorithms.

CHARACTERISTICS OF SIGN LANGUAGES Cheremic characteristics In all sign languages, the equivalent of the word is the sign. Stokoe first recognized that every sign can be anal sed into at least three components: (i) the place on x e body where the sign is made, which is called tabular or ‘tab’; (ii) the shape of the hand or hands, called designation or ‘dez’; and (iii) the movement of the hand or hands, called signation or ‘sig’. Changing a chereme of a sign alters the sign meaning. There are approximately 6000 signs in ASL, which is a relatively small number compared with 600 000 words of the English language. ASL has no written form and so must be referred to by the use of English as a close approximation. However there are many ASL signs that are very difficult to translate into English.

Syntactic characteristics Bell@ and Fischer (see Sperling4) have noted that signs generally take twice as long to produce as spoken words but take about the same time to express ideas. It is speculated that there might be some ‘presentation rate’ at which the human mind best understands ideas communicated through languages. This makes it necessary for signers to be economical with time. This is done in ASL by omitting redundant syntactic markers and taking full advantage of the multidimensionality present in visual and manual modalities. When the subject and object of a sentence are non-reversible, any sign order is permissible. For example, the translation of ‘the man starts the car’, could be: ‘man start car’; ‘car start man’; or ‘start car man’. For other cases, the ASL usually follows a subject-verb-object order.

INTERPRETATION OF ASL USING IMAGE MOTION ANALYSIS Since human interpretation of sign language demands both spatio-temporal motion analysis and syntactic processing, the algorithms necessary for parsing the sign language implemented in this study are real-time motion detection using temporal intensity changes, hand location tracking, and classification of each sign using stop positions, simple shapes of trajectory and hand shape at each stop position. The more detailed definitions will be given in the following sections.

420

J. Biomed.

Eng. 1992, Vol. 14, September

The algorithms were implemented on an 8MHz IBM PC AT compatible computer equi ped with a PCVISION PLUS frame grabber boar B with a 512 by 480 by 8-bit resolution which can grab the image from a camera or video ta e recorder at the TV frame rate (30 frames per secon x ). The programs, written in the programming language ‘C’, mani ulated the hardware of the board and displayed t! e resulting images on a black and white monitor. This frame grabber has some very useful facilities such as input look-up tables and TV synchronization status registers which are necessary to make the real-time processing possible. Since the real-time processing for a standard TV image has to process at a very high data rate (512 by 480 by 30 frames per second, that is, 7Mbyte er second) and, also to store the image at this spee J , techniques that help to reduce the amount of data to be processed are required.

Detection of instantaneous motion The motion of the hands can be detected in real time by comparing a grey scale intensity of two consecutive image frames, i.e. if Z(n, x, y) represents the intensity of image frame number n at position X, y, the difference image frame number n at the same coordinates, D (n, x, y), is defined as: D(n, X, y) = 1 if abs[Z(n, X, y) - Z(n - 1, x, y)] > threshold = 0 otherwise. Whether the signer is still, or moving, can be classified by counting the number of difference image pixels which have the value 1 for each frame and compare it with another predefined threshold as a percentage of the overall number of pixels in the image.

Centroid calculation and hand position tracking The coordinates

of the centroid or centre of gravity of the difference picture at frame number n, x,(n) and yc (n), which represent a good approximation of hand position, were computed from:

3

Y&4

=

y

D(n,x,y)=l I‘

where n is the total number of pixels and where D(n, x, y) = 1. Once the centroid of a difference image was calculated, the calculation of the next centroid was limited within a smaller area around the current hand position. This prevented the motion of other objects such as the face and background scene from introducing an error into the centroid calculation of the next image frame. In this study, we limited the hand motion to a simple type. The signer started each sign with the

~W?prOcesSiug

right hand moved from a predefined (neutral) position, stopped at two places with specific hand shapes and returned to the neutral position. The program detected only when the hand started moving, tracked and stored its position, and kept only the image area around the hand at the stop locations. This continued until the hand stop ed at the neutral position. The trajectory used for c Passification was derived from the first stop and the second stop. Figure 7 shows the results when these techniques were applied to a test object. The rectangular object was placed on a white pa er and was forced to move upward to the right. T Ke program calculated the difference image shown in Figure 7b in real time by using the superimposed image in Figure la and its movement. The centroid of a difference image was calculated and marked in Figure 7b. Figure 2 shows the trajectory of the object from its start position to the stop position where we can see the object has stopped.

system for sign language: C. Chrayaphan

and A. E. Marble

these two positions, and the image of the hand at each location. The stop locations, and eccentricities of five samples were aIso recorded. These parameters were defined as follows: x,(O). . . x1(4) are the first stop x coordinates samples, yl(0) . . . yr(4) are the first stop y coordinates samples, x2(O) . , . ~~(4) are the second stop xcoordinates samples, and ~(0) . . . y2(4) are the second stop y coordinates samples.

of five of five of five of five

The sample means were calculated from these five samples. The first stop x sample mean and the first stop y sample mean, x,1 and yCl, are defined as:

i, xi-1 =

4) 5

Classification of each sign The classification

of each input sign was based on a sequence of three tests: stop positions; simple shape of the trajectory; and the sha e of hand at stop osition. The searching stopped w f: en only one of tKe reference signs matched with the input sign or when all the three tests are applied. The reference parameters were updated based on the last five signs matched. For each sign, the parameters needed were the first and the second stop positions, the trajectory between

z,Ydi) yc1 =

5

And the second stop x sample mean and y sample mean, x,2 and yC2, are defined as:

i, X‘Q =

x2@) 5

i, Yd4 yc2 =

5

The variance of each sample mean was also calculated. The standard deviations of the first stop x and y coordinates, ox 1 and or 1, were calculated using: Figure 1 A pictorial representation of the motion detection algorithm: a, the start position of the object. b, the difference image with the location of the initial and final centroid

l/2

4

C x?(i) ffxl

=

5

1 1

x:,

i-0 -

l/2

$, Gyl

=

y?(i)

5

i

-$I

Similarly, the standard deviations of the second x and y coordinates, ~7x2 and ~7~2were calculated using:

( i,

u.& =

l/2

444 5

-

$2

) l/2

Ii,r&(i) Figure 2

The trajectory of the centmid of the object shown in Figure 1

5

-

yK 1

J. Biomed. Eng. ICWL,Vol. 14, September

421

Image processingJrstem

fir sign !.nuguage:C. Chnrayaphn

Using the overall at the first and second positions:

Using stop locations.

deviation IT1

=

and A. E. Marble

standard

(a$ + u;y2

and a2

=

(CT5 + u;y2

The matching function based on the stop positions, MF,,,, was measured using:

where rl and r2 are the Euclidian distances of the input positions, xlyl and x2y2, and their references, ~1,

ycl ad

x,2,

-0.10 and 0.10; ‘Curve left’ for the between eccentricity greater than 0.10; and ‘Curve right’ for the eccentricity less than -0.10. If only one of the candidates found using the first test matched, the input sign would be classified. Otherwise, the next test, using the shape of hand at stop positions, was checked.

Using hand shapes at stop positions. When the trajectory shapes were not enough to classify a match, the hand shape at the last stop of each sign was used. The realtime hand tracking routine kept the ap roximate position centroid of the hand position an cf so stored the small image area around the hand. The hand shape at the last stop was classified using the generalized Hough transform technique which is found in Ballard7.

yc2:

JXPERlMFNTATION

r2 = qx2

-

xJ2

+ (y2 - yc*)2

If there was only one reference that matched the input parameters with a high enough matching factor, this input sign would be classified as one of the reference type. Otherwise, when there was no single best match and there were two or more candidates, the next test, simple shape of the trajectory, was checked.

Using simple shapes of the trajectory. The eccentricity of a trajectory is defined as the ratio of the maximum distance of a point in the trajectory from the straight line connecting the first stop and the second stop to the distance between the two stop locations. Iigare 3 demonstrates the definition of the eccentricity used in this paper. When looking from first stop to second stop, a value of eccentricity larger than zero means that the point of maximum deviation is on the left. The value of eccentricity smaller than zero means that it is on the right. Using the eccentricity, the shape of a trajectory was classified as: ‘Straight line’ for the eccentricity

To test the classification performance using these techniques, 31 ASL signs made by the first author were used. This set is very small compared to the set of approximately 6000 signs but should be enough for demonstration. The C program was written for an 8MHz IBM AT microcomputer equipped with a PCVISION PLUS image processing board to perform hand tracking as described earlier. The 256 by 240 image size was used and the coordinate system is shown in Figure 4. The trajectory, its eccentricity, and hand images of each sign were saved. Table 1 shows the recorded stop positions and the eccentricity of these signs. Some trajectories are shown in Figure 5. Notice that the spacing between two centroids of each trajectory varies inversely to the instantaneous s atial velocity of the hand. Therefore it gets smaller wKen the signer moves toward the second stop position. According to the difficulty and the reliability of the real-time tracking algorithm when implemented in the general purpose corn uter, only one record of the trajectory was measure dp for each sign and used as reference. A program was written to simulate the real

Signeljyf _

---

___

I

I

I

Figure 3 The eccentricity of the trajectory as defined by D/H. H represents the linear distance between the start and stop positions, and D is the largest distance the trajectory is away from the straight line between start and stop positions

422

J. Biomed. Eng. 1992, Vol. 14, September

I

,

256 x Figure 4 The coordinate system used in testing the sign language computer vision system

Imageprocessing systan fir sign language: C. Charayaphan and A. E. Marble Table 1

The stop positionsand eccentricitiesof 31 ASL signs

Sign Afternoon ;“d Band Cannot Feel Good Help Home Hungry Learn Morning No Please Bight See Show Sleep Slow Talk Tell Thank you Time Today Understand We What Where Wrong Yes Yesterday

First position x

Second position x Y

Y

Eccentricity

98 42 76

174 104 183

164 153 155

53 163 96

0.079 -0.103 0.321

113 79 98 118 100 87 114 157 84 55 106 123 114 138 173 159 78 133 100 113 79 75 78 156 78 119 x2 88

139 171 63 137 59 142 121 70 103 173 43 167 147 126 120 34 112 131 145 114 159 186 141 110 149 117 146 151

36 119 80 164 90 80 117 76 8.5 45 134 158 168 199 193 102 10 170 80 150 178 56 147 164 20 108 75 59

37 85 16.5 81 130 99 81 176 167 120 90 51 93 85 66 128 156 55 43 40 58 167 11.5 27 87 147 87 177

0.179 -0.021 0.063 -0.040 0.049 -0.125 -0.035 -0.020 0.016 - 0.048 0.013 0.057 0.037 -0.167 -0.065 -0.026 0.009 -0.095 -0.026 -0.157 0.051 -0.132 -0.091 -0.014 - 0.096 0.044 0.056 0.038

Figure 5

The trajectories of a number of common signs: (a) ‘good', (b) ‘see’, (c) ‘tell’, (d) ‘feel’ and (e) ‘morning’. It is noted that (a)-(c) are quite similar to one another, as are (d) and (e).

situation by generating random offsets from stop positions and so the mean of the stop positions and the standard deviations of each sign can be computed and used for the adaptive classification as explained earlier. The initial standard deviation for each stop position was set to approximately 3. The machine performance was tested by generating a simulated input pattern for each sign by adding to its stop positions the offset within - 12 to + 12 pixels for each x and y axis. The adaptive position matching explained earlier was then applied and as can be seen in Table 2.22 of

31 signs could be classified using only this test. Figure 5 shows the sample trajectories of two signs, ‘good’ which candidates ‘see’ and ‘tell’, and ‘feel’ which has the candidate ‘morning’. Figure 6 shows the signs ‘good’, ‘see’ and ‘tell’ which have similar stop positions and trajectories. The next test, using the eccentricity can classi 5 signs out of the 9 groups of candidates unclassi Yled after the first test. For example, the first test gives ‘thank you’ as candidate of the sign ‘bad’ but its eccentricity is ‘Curve right’ while that of the sign ‘bad’ is ‘Straight line’ and the latter was selected. From the table, the unclassified signs from the second test are: (i) ‘feel’ with candidate ‘morning’; (ii) ‘good’ with candidates ‘see’ and ‘tell’; (iii) ‘moming’ with candidate ‘feel’; (iv) ‘see’ with candidate ‘good’. We can conclude from these classifications that the cases (i) and (iii) mean that we have to use the third test to classify the hand shape of the signs ‘feel’ and ‘morning’. Similarly the cases (ii) and (iv) mean that we have to classify the hand shapes among the signs ‘good’, ‘see’, and ‘tell’. Figure 7 shows the edge detection of the hand at the second stop position for the signs ‘good’, ‘see’, ‘tell’, ‘feel’, and ‘morning’ respectively. It can be seen that while they have similar trajectories and stop positions, the hand shapes are different. A program was written to perform a generalized Hough transformation on these hand shapes and the R-tables, one for each shape, were created and saved as references. Another set of the five hand shapes in Figure 7 were then taken and used as test inputs. This allowed testing the matching algorithms with inexact


423

Image processingsystem jiwsign language: C. Charayaphan and A. E. Marble Results of applying the algorithms with simulated data

Table 2

Sign

First test Candidate (eccentricity)

Eccentricity

Afternoon Again And Bad Cannot Feel Good Help Home Hungry Learn Morning No Please Right See Show Sleep Slow Talk Tell Thank you Time Today Understand We What Where Wrong Yes Yesterday

0.079 0.321 -0.103 0.179 -0.021 0.063 - 0.040 0.049 -0.125 -0.035 -0.020 0.016 -0.048 0.013 0.057 0.037 -0.167 -0.065 -0.026 0.009 -0.095 -0.026 -0.157 0.051 -0.132 -0.091 -0.014 -0.096 0.044 0.056 0.038

Second test Candidate (S)

None None None Thank you (-0.26) None Morning (0.016), Help (0.049) See (0.037), tell (-0.095) None Yes (0.056) None None Feel (0.063) None None None Good (-0.040) None None None None None None None None Yesterday (0.038) None None None None Home (-0.125) Understand (-0.132)

None Morning See, tell

Good

b

See,

None None

None

Feel

None

Good

None

None

None None

a

a

Third test Candidates

b

C

sight

e

d

Figure 7 Edge detected images of signs: a, ‘good’; b, ‘see’; c, ‘tell’; d, ‘feel’; and e, ‘morning’, as obtained using the generalized Hough transform Table 3 The matched number from using the generalized transform with signs ‘good’, ‘see’, and ‘tell’ Good

C

See

Tell

37 78 57

35 68 65

Tell

Figure 6 Signs which have similar trajectories: and c, ‘tell’

a, ‘good’; b, ‘see’;

input shape. The generalized Hough transform matching technique took into account the rotation but not the size of the image. The results in Table 3 and Table 4 show the number of matches obtained for the group ‘good’, ‘see’. ‘tell’, and the group ‘feel’, ‘mom-

424

.


Good See Tell

60 46 40

Hough

ing’ respectively. We can see that the input sign ‘see’ matched best the reference ‘see’ (78) not the other signs, ‘good’, and ‘tell’, with matched numbers (46) and (68) respectively. The same conclusion was reached for the other signs in both tables and so we

Image processing system

Table 4 The matched number fkom using the generalized Hough transform wtih signs ‘feel’ and ‘morning’

Feel Morning

Feel

Morning

124 28

33 134

can differentiate each sign in both groups. However in real applications, the input pattern may be very different from the reference and further adaptive algorithms which can adjust their reference pattern automatically are necessary. CONCLUSIONS The proposed techniques for inte reting ASL signs successfully classified the test samprYe of 3 1. Only one of the three tests was necessary to classify 22 signs. The second test, applying the simple shape of the trajecto , can classify 5 more of the remaining 9 signs. Tx e rest of the signs can also be interpreted using the last test, the generalized Hough transform matching test. Because the first two tests are fast, taking approximately one second, and can classify 27 out of 31 signs, the interpretation of real-time signs However, the variations appears to be possible.

fir sign language: C. Charayaphn

and A. E. Marbble

caused by different signers, the lighting environment, and a much larger vocabulary might affect the accuracy of the algorithm. The next study will deal with a much larger sign space and the variations of each sign due to individual signers. The intelligent language syntax and analyser may be applied as the next test to help the system deal with these factors. REFERENCES 1. Metro Service for the Deaf, Suite 101, 1657 Barrington St., Halifax, Nova Scotia, Sign Language. 2. Tartter VC and Knowlton CK. Perception of sign language from an array of 27 moving spots. Nature, 1981; 239: 676-78. 3. Sperling G. Bandwidth requirements for video transmission

4.

5.

6. 7.

of Amekan sign language-and finger spelling. Science1980; 210: 797-9. Sperling G, Michael L, Yoav C and Pave1 M. Intelligible encoding of ASL image sequence at extremely low information rates. Computer Vision, Graphics, and Zmage Processing, 1985; 31: 335-91. Kawai H and Tamura S. Deaf-and-mute sign language generation system. Pattern Recognition 1985; 18 (314): 199205. Tamura S and Kawasaki S. Recognition of sign language motion image. Pattern Recognition, 1988; 21: 343-53. Ballard DH. Generalizing the Hough transform to detect arbitrary shapes. Pattern &cognition 1981; 13: 111-22.

J. Biomed. Eng. 1992, Vol. 14, September 425

Phonological awareness for american sign language.

The Processing of Biologically Plausible and Implausible forms in American Sign Language: Evidence for Perceptual Tuning.

Syntactic priming in American Sign Language.

Health websites: accessibility and usability for American sign language users.

Assessing Health Literacy in Deaf American Sign Language Users.

Reproducing American Sign Language sentences: cognitive scaffolding in working memory.

Onboard Image Processing System for Hyperspectral Sensor.

Sexual health behaviors of Deaf American Sign Language (ASL) users.

Acquisition of American sign language by a noncommunicating autistic child.

Phonetic reduction and variation in American Sign Language: A quantitative study of sign lowering.

Child Modifiability as a Predictor of Language Abilities in Deaf Children Who Use American Sign Language.

Impacts of Visual Sonority and Handshape Markedness on Second Language Learning of American Sign Language.

Modality-specific processing precedes amodal linguistic processing during L2 sign language acquisition: A longitudinal study.

The neurobiology of sign language and the mirror system hypothesis.

An Automatic Image Processing System for Glaucoma Screening.

stereological analysis system for transmission electron microscopy.

Foreign language interpreting in psychiatric practice.

Saccade-induced image motion cannot account for post-saccadic enhancement of visual processing in primate MST.

Discriminant features and temporal structure of nonmanuals in American Sign Language.

Acquisition of signs from American sign language in hearing individuals following left hemisphere damage and aphasia.

Motion processing across multiple topographic maps in the electrosensory system.

ERP correlates of German Sign Language processing in deaf native signers.

Motion aftereffects: evidence for parallel processing in motion perception.

The emergence of temporal language in Nicaraguan Sign Language.