Graphical evaluation of vocal fold vibratory patterns by high-speed videolaryngoscopy.

Graphical Evaluation of Vocal Fold Vibratory Patterns by High-Speed Videolaryngoscopy ^nia Dajer, †Adriana Hachiya, ‡Arlindo N. Montagnoli, and †Domingos Tsuji, *Patos *Alan P. Pinheiro, †Maria Euge de Minas, Minas Gerais, yS~ao Paulo, and zS~ao Carlos, Brazil

Summary: Objective. To characterize the voice and vocal fold function of an individual, it is essential to evaluate vocal fold vibration. The most widely used method for this purpose has been videolaryngoscopy. Methods. This article proposes a digital image processing algorithm to estimate the glottal area (ie, the space between the vocal folds) and produce graphs of the opening and closing phases of the glottal cycle. In eight subjects without voice disorders, vocal fold movements were recorded by high-speed videolaryngoscopy at 4000 frames per second. The video data were processed by a combination of image segmentation techniques that estimate the glottal area. The segmented area was used to construct the glottal waveform. Results. The graphs revealed important properties of vocal fold vibration, including amplitude, velocity, and other characteristics that have a major influence on voice quality. Conclusions. The combination of the high-speed technology with the proposed method improves the vocal fold analysis given a numerical feedback through graphical representation of the real vibratory patterns of the folds. Key Words: Glottal area–Segmentation–Videolaryngoscopy–High-speed imaging. INTRODUCTION During phonation, the vocal folds produce self-sustained oscillation that models the airflow from the lungs, generating a primary signal that is used as a source of excitation of the vocal tract. This signal is known as the glottal flow and determines some of the basic properties of the voice, including its fundamental frequency and main spectral components, which are modeled in the vocal tract1 and allow the production of different sounds. Typically, many voice pathologies are caused by changes in the structure or physiology of the vocal folds and such changes reflect in the glottal flow waveform.2 Therefore, it is important to develop methods for studying the vocal folds. Various methods have been proposed to analyze the vocal folds and the glottal flow waveform. Such methods include inverse filtering of the signals,3 electroglottography,4 and videolaryngoscopy.5 In clinical and research settings, the gold standard for the analysis of the vocal folds and their vibratory motion is videolaryngoscopy.6 However, videolaryngoscopy is generally performed with cameras that can record no more than 20–30 frames per second (fps). Because the vocal folds vibrate at a mean frequency of 100–400 cycles per second, it is impossible to record their real vibratory motion by means of conventional laryngoscopy. Therefore, traditional videolaryngoscopy is used in an attempt to record some samples of phases of the vibratory cycle of the folds and create an artificial cycle using stroboscopic techniques. Because the videolaryngoscopy provides Accepted for publication July 30, 2013. The present study received financial support from the S~ao Paulo Research Foundation (FAPESP; grant no. 2010/18488). From the *Nucleus of Scientific and Technological Development, Faculty of Electrical Engineering, Federal University of Uberlândia, Patos de Minas, Brazil; ySchool of Medicine, University of S~ao Paulo, S~ao Paulo, Brazil; and the zDepartment of Electrical Engineering, Federal University of S~ao Carlos, S~ao Carlos, Brazil Address correspondence and reprint requests to Alan P. Pinheiro, Universidade Federal de Uberlândia, Av Get ulio Vargas 230, Patos de Minas, Minas Gerais, CEP 38.700-128, Brazil. E-mail: [email protected] Journal of Voice, Vol. 28, No. 1, pp. 106-111 0892-1997/$36.00 Ó 2014 The Voice Foundation http://dx.doi.org/10.1016/j.jvoice.2013.07.014

images that lack detail, it is most frequently used exclusively for qualitative evaluations. The advent of high-speed cameras and the possibility of using such cameras in conjunction with videolaryngoscopy opened new avenues for clinical and scientific research on vocal function. High-speed videolaryngoscopy provides detailed images of the entire cycle of vocal fold vibration. As an immediate result of this new technology, various studies have been conducted to quantify physiological vocal fold parameters,7–9 investigate the aerodynamic transfer of energy from glottal airflow to vocal fold tissue during phonation,10 determine the mechanical properties of vocal fold tissues,11 and thoroughly evaluate the effects of specific voice disorders.2 Studies comparing conventional with high-speed videolaryngoscopy have confirmed the advantages of the latter over the former.5,6 High-speed imaging techniques can provide a large quantity of images (approximately 2000–4000 fps). Because manual analysis of such images would be tedious, new digital image processing methods are required. In addition, high-speed videolaryngoscopy produces low-resolution images affected by noise that is inherent to the imaging method as well as producing uneven illumination. Studies aimed at developing new algorithms for processing such images have, therefore, been conducted.12–14 However, most of the methods presented in those studies showed little or no automaticity, having been used to analyze excised vocal fold images obtained under ideal laboratory conditions and being dependent on good image resolution. The objective of the present study was to develop a computational method for calculating (segmenting) the glottal area (ie, the space between the vocal folds) using images acquired by high-speed videolaryngoscopy. By calculating the glottal area, it is possible to estimate the glottal flow waveform that excites the vocal tract and characterize the opening and closing of the vocal folds. The present study focused on the application of the method in the evaluation of in vivo clinical images, which often contain artifacts. In addition, the performance and sensitivity of the method were tested to determine its validity and robustness.

Alan P. Pinheiro, et al

Graphical Evaluation of Vocal Fold Vibratory Patterns

METHODS Data collection Eight individuals (four males and four females), with healthy voices, were invited to participate in the present study. The mean age was 42 years. All the participants underwent videolaryngoscopy with a high-speed camera (ENDOCAM 5262; Richard Wolf, Knittlingen, Germany), which was capable of recording 4000 fps for a duration of 2 seconds. The images had a resolution of 256 3 256 pixels. Figure 1 shows one of the study participants undergoing high-speed videolaryngoscopy. For high-speed videolaryngoscopy, all the participants were asked to sustain the vowel /a/. The study procedures were approved by the Research Ethics Committee of the University of S~ao Paulo, School of Medicine (protocol no. 0767/09). Glottal area segmentation The algorithm developed to estimate (ie, segment) the glottal area as seen on those images consists of a sequence of steps that are applied to each frame. Figure 2 shows a block diagram illustrating the procedure. Unlike other methods,12–14 which are aimed at segmenting the glottal area directly in few computational steps, the method developed in the present study was to obtain an initial rough estimate of the location of the glottal area and refine that estimate in subsequent steps until the entire space between the right and left vocal folds has been defined as accurately as possible. The first step of the method was designated initialization (Figure 2A), in which the first frame of the high-speed video was displayed. In the initialization step, the user should select n points (n 1) using the computer mouse. The selected points should be in the glottic region and are used to estimate the mean color intensity (m) in the glottic region as shown in Equation (1), where i and j indicate the image coordinates of the selected points. This is the only manual step of the method, being performed only once for each examination. m ¼ 1=n

n X

Mð½i; jÞ:

(1)

x¼1

The second step of the method was designated binarization (or thresholding), in which the glottic region was separated from the remaining regions seen on the image on the basis of the threshold, which was calculated by the Equation (2). T ¼ m±sði; jÞ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi y x P P 2 1 sði; jÞ ¼ x$y ðM½i; j mÞ :

(2)

i¼1 j¼1

In Equation (2), T indicates the threshold; m the mean color intensity in the glottic region calculated in Equation (1); and s the standard deviation of the surroundings of each pixel in the image that will be binarized. The method used a technique known as local adaptive thresholding, which is based on the statistical relationship between each pixel and its surroundings. The algorithm uses the mean color intensity in the glottic region, as defined by the points

107

FIGURE 1. Videolaryngoscopy with a high-speed camera. selected manually in the first step, and the standard deviation of the surroundings of each pixel in the image that will be binarized. If the color intensity of the pixel at the coordinate point is lower than the threshold, the pixel is displayed in white, whereas the remaining (ie, unselected) pixels are displayed in black. Therefore, the regions of the image in which the color is similar to that of the glottal area defined in the initial step are selected, the remaining regions being eliminated from the image. Preliminary analyses reveal that most of the glottis was identified and displayed in white (Figure 2B). However, due to lighting problems and low contrast, certain regions outside the glottis were selected because their color was similar to that of the glottal area. Those selections constitute typical noises in the image. The third step of the method was designated as seed estimation (Figure 2C), in which the image that was binarized in the previous step was filtered by means of a traditional method of image erosion and dilation,15 thereby eliminating small noises (seen in white in Figure 2B) and irrelevant details. After the image had been filtered, a blob algorithm16 was applied to the image. The algorithm identifies the largest segmented element (named blob) in the binary image and classifies it as being within the target area (the glottal area, in this case). Finally, the centroid of that blob was estimated, and the coordinates of the centroid were used as a ‘‘seed’’ in the next step. The seed point estimation process is used to automatically estimate, with some degree of certainty, a single coordinate on the image within the glottic region on the basis of the binary image. As previously mentioned, the user manually selects points in the glottic region only on the first frame of the video. For the remaining frames, the estimation is automatically provided by seed estimation. Cases in which the glottal area is small or absent (eg, cases in which the vocal folds are closed), the image erosion process eliminates all image elements and no seed is defined, the area, therefore, being classified as null. After seed estimation, the region growing step is initiated (Figure 2D). The technique used in the present study expands the seed iteratively (generating a segmented area) by analyzing all pixels that surround the boundary of this segmented area in accordance with a homogeneity criterion that evaluates the difference between the intensity value of the surrounding pixels and the mean intensity of the segmented area.17 If that

108

Journal of Voice, Vol. 28, No. 1, 2014

FIGURE 2. Block diagram illustrating the steps of the algorithm developed to calculate the glottal area (space between the vocal folds).

difference is less than a given percentage (established by the user in the initialization step of the method), the pixel showing the smallest difference is incorporated into the segmented area. This process is repeated until the difference between the mean intensity of the segmented region and the intensity of the surrounding pixels is above the percentage threshold set by the user as the homogeneity criterion. In the present study, the mean tolerance was set at 15%. Therefore, pixels showing a difference in color of up to 15% in relation to the segmented region in question are incorporated into it, and this process is repeated until there are no more pixels meeting the homogeneity criterion. At the end of the region growing process, the area defined should correspond to the glottal area, as shown in Figure 2D. It is possible that the shape of the calculated glottal area is irregular because of image artifacts or imperfections in the algorithm (Figure 2D). Therefore, in the last step of the method, a morphologic operation known as closing15 is applied to the segmented area to minimize such errors, producing a smoother image (Figure 2E). All the steps described above are repeated for each of the high-speed video frames until all frames have been processed. Performance evaluation To evaluate the performance of the segmentation method, two high-speed videos, which were not part of the study sample, were used. Both videos underwent a segmentation process as described in the present study. They also were manually segmented by an experienced user, which manually delineated the glottal area on each frame using computer mouse. For each video, only the initial frames (approximately 200 frames) were segmented. The results of the two types of segmentation (automated and manual) were compared. Equation (3) was used to measure the percentage difference between the two curves, n Agm ðkÞ Age ðkÞ 1X E¼ 3100%; (3) n k¼1 Agep where E is the relative error between the curves Agm and Age; n is the number of points in the glottal area (ie, the number of segmented frames); Agm is the glottal area estimated by the pro-

posed method; Age is the glottal area manually estimated by the expert; and Agep is the maximum peak value of Age used to normalize the error. To analyze the influence that image resolution had on the results obtained with the proposed method, the authors performed a second analysis, in which the image resolution was reduced for the two videos used to estimate the method performance (from 256 3 256 pixels to 128 3 128 pixels). To aid in the comparison between the segmentation of those videos at the higher resolution and that of the same videos at lower resolution, the graphs were normalized for direct comparison. Graphical evaluation After the number of pixels constituting the glottal area has been estimated on each frame, it is possible to estimate the glottal area waveform (GAW). Each GAW sample represents a processed image (or frame) and its respective glottal area value (Figure 3). Therefore, the GAW indicates the opening and closing of the vocal folds. The GAW is a well-known parameter in the technical literature because it is similar to the glottal flow waveform that excites the vocal tract. RESULTS Figure 4 shows the results of the performance evaluation of the algorithm for the two processed videos. Figure 4A and B, respectively, show the GAW estimated by the algorithm and manual segmentation. The relative errors for the two curves, estimated by Equation (3), were 6.7% and 4.2%, respectively. The GAW calculated by the algorithm was found to have a smoother shape. This can be partially explained by the filtering process that the images underwent. In addition, delineation of the glottal area by the method proposed in the present study provides an objective numerical criterion. The largest discrepancies were detected at the peak of the waveform, where the area is largest. The evaluation of the sensitivity of the algorithm (Figure 4C and D) revealed that the basic characteristics of the GAW estimated by the algorithm were preserved despite the fact that the image resolution had been reduced by half. The relative errors for the waveforms of the resized videos were 7.2%


109


FIGURE 3. Time series illustrating the variation in glottal area, frame by frame. This time series is commonly known as the GAW. Note the closed phase, opening phase, and closing phase of the glottal cycle. fact that the algorithm is based on the assumption that there is only one space between the vocal folds (rather than two, as was the case here), the smaller space being automatically discarded. However, the user can intervene and manually indicate, with a single mouse click, the area that was not processed by the algorithm. Despite the abovementioned segmentation failure, the algorithm showed good automaticity and required little intervention.

(between the signals of Figure 4C) and 4.5% (between the signals of Figure 4D). Figure 5 shows the results of the process for some of the images obtained from the eight individuals analyzed. Figure 5 also shows the GAW for each of those individuals. The segmented frames show that the glottal area as measured on the images coincided with the glottal area delineated by the edges of the vocal folds. The remaining frames showed similar results. The GAW shows the opening and closing phase of the vocal folds, the closing phase being slightly faster than the opening phase. This characteristic, which can only be seen on high-speed videos, is responsible for producing a glottal flow spectrum that is richer in harmonic components.1 The smooth shape of the GAW indicates that the waveform has few stochastic components, that is, less noise. Cases in which the images are highly degraded by noise or in which the segmentation process is ineffective, the number of stochastic components is noticeably larger. The tests performed revealed no false classifications, the algorithm having successfully defined a seed for all the frames on which there was a glottal area. The second frame shown in Figure 5C reveals an imperfection in the segmentation. This was due to the fact that only the anterior portion of the glottis was delineated, a small opening in the posterior portion (typically seen in female vocal folds at the beginning of the opening phase) having been disregarded. This can be attributed to the

A

B

DISCUSSION The algorithm was capable of accurately depicting the cycle of vocal fold vibration, generating a GAW which has a waveform similar to the glottal flow and indicates many features of vocal fold vibration. In comparison with other methods described in the literature,12–14 the algorithm developed in the present study has the notable advantage of allowing the processing of in vivo clinical images with low contrast, uneven illumination, relative motion, and limited resolution. In addition, the eight samples processed required little manual intervention, the algorithm having allowed a high degree of automaticity in the analysis of the images. The algorithm developed in the present study was found to have few of the limitations of other segmentation algorithms. The algorithm was also found to be robust, in that it was not overly sensitive to color variations in the glottic region and was able to process low-contrast frames.

C

D

FIGURE 4. Evaluation of the performance of the proposed algorithm. In (A) and (B), respectively, the GAW as segmented by the proposed method (dotted black curve) and as manually segmented by an expert (solid blue curve) for two different high-speed videos. In (C) and (D), respectively, segmentation of the video at the original resolution (256 3 256 pixels; dotted black curve) and at a lower resolution (128 3 128 pixels; solid blue curve). In (C) and (D), the curves were normalized for comparison. (For interpretation of the references to color in this figure legend, the reader is referred to the online version of this article.)

110

Journal of Voice, Vol. 28, No. 1, 2014 (b)- Estimated glottal area waveform (subject 1)

(a)- Segmented frames (subject 1)

Area (pixels²)

150 100 50 0

0

5

10

15

20

25

30

25

30

25

30

25

30

25

30

25

30

25

30

25

30

Time (ms) (c)- Segmented frames (subject 2)

(d)- Estimated glottal area waveform (subject 2) Area (pixels²)

150 100 50 0

0

5

10

15

20

Time (ms) (f)- Estimated glottal area waveform (subject 3)

(e)- Segmented frames (subject 3)

Area (pixels²)

150 100 50 0

0

5

10

15

20

Time (ms) (h)- Estimated glottal area waveform (subject 4)

(g)- Segmented frames (subject 4)

Area (pixels²)

100

50

0

0

5

10

15

20

Time (ms) (j)- Estimated glottal area waveform (subject 5)

(i)- Segmented frames (subject 5)

Area (pixels²)

100

50

0

0

5

10

15

20

Time (ms) (k)- Segmented frames (subject 6)

(l)- Estimated glottal area waveform (subject 6)

Area (pixels²)

150 100 50 0

0

5

10

15

20

Time (ms) (n)- Estimated glottal area waveform (subject 7)

(m)- Segmented frames (subject 7)

Area (pixels²)

150 100 50 0

0

5

10

15

20

Time (ms) (p)- Estimated glottal area waveform (subject 8)

(o)- Segmented frames (subject 8) Area (pixels²)

100

50

0

0

5

10

15

20

Time (ms)

FIGURE 5. Segmented vocal folds (left column). The white edges indicate the area that was segmented by the algorithm developed in the present study. The right column shows the GAW (in square pixels) as estimated by the method. The horizontal axis of the glottal area indicates time in milliseconds.

The use of a computational method for semiautomated segmentation of the glottal area allows a more accurate differentiation between the glottal area and the vocal fold edges on the basis of a fixed, numerical criterion, therefore, reducing the subjectivity of the analysis and allowing the quantification of the

images. In addition, because high-speed cameras provide a large number of images, manual segmentation is impractical. One major difficulty in segmenting images obtained with a high-speed camera is the delineation of the vocal folds. This is because the vocal fold edges are rounded and the vocal fold



mucosa creates ill-defined borders. In addition, uneven illumination results in an irregularly shaped glottal area, constituting an obstacle to the region dilation step of the algorithm. Therefore, this question motivated the use of a local segmentation strategy. In cases of highly degraded images, prior equalization can also contribute to the automaticity and precision of the method. Finally, the proposed image processing algorithm naturally produces a GAW given in square pixels. The use of a secondary algorithm to convert pixels to world coordinates (ie, millimeters), like the approach proposed in,18 would improve the results by allowing a direct and quantitative comparison between patients. However, this article only focuses on the segmentation of vocal folds and GAW estimation. Future works may include a procedure to estimate absolute coordinates. Additionally, the application of the proposed method in the clinical practice for studying pathologic subjects may also reveal new insights. On the basis of the results of the present study, the authors suggest that the method proposed here is valid for segmenting the glottal area, on the basis of images obtained by highspeed videolaryngoscopy. Graphical evaluation of the GAW is an important tool to assess the kinematics of vocal fold vibration over several glottal cycles. REFERENCES 1. Story BH. An overview of the physiology, physics and modeling of the sound source for vowels. Acoust Sci Technol. 2002;2:195–206. 2. Yan Y, Danrose E, Bless D. Functional analysis of voice using simultaneous high-speed imaging and acoustic recordings. J Voice. 2007;21:604–616. 3. Alku P, Vilkman E, Laukkanen A-M. Estimation of amplitude features of the glottal flow by inverse filtering speech pressure signals. Speech Commun. 1998;24:123–132. 4. Roubeau B, Henrich N, Castellengo M. Laryngeal vibratory mechanisms: the notion of vocal register revisited. J Voice. 2009;23:425–438.

111

5. Olthoff A, Woywod C, Kruse E. Stroboscopy versus high-speed glottography: a comparative study. Laryngoscope. 2007;117:1123–1126. 6. Kendall KA. High-speed laryngeal imaging compared with videostroboscopy in healthy subjects. Arch Otolaryngol Head Neck Surg. 2009;135: 274–281. 7. Yang A, Stingl M, Berry DA, Lohscheller J, Voigt D, Eysholdt U, Dollinger M. Computational of physiological human vocal fold parameters by mathematical optimization of a biomechanical model. J Acoustic Soc Am. 2011;130:949–964. 8. Pinheiro AP, Stewart DE, Pereira JC, Oliveira S. Analysis of nonlinear dynamics of vocal folds using high-speed video observation and biomechanical modeling. Digit Signal Process. 2012;22:304–313. 9. Tao C, Zhang Y, Jiang JJ. Extracting physiologically relevant parameters of vocal folds from high-speed video image series. IEEE Trans Biomed Eng. 2007;54:794–801. 10. Thomson SL, Mongeau L, Frankel SH. Aerodynamic transfer of energy to the vocal folds. J Acoustic Soc Am. 2005;118:1689–1700. 11. Jiang JJ, Zhang Y, McGilligan C. Chaos in voice, from modeling to measurement. J Voice. 2006;20:2–17. 12. Lohscheller J, Toy H, Rosanowski F, Eysholdt U, D€ollinger M. Clinically evaluated procedure for the reconstruction of vocal fold vibrations from endoscopic digital high-speed videos. Med Image Anal. 2007;11:400–413. 13. Yan Y, Chen X, Bless D. Automatic tracing of vocal-fold motion from highspeed digital images. IEEE Trans Biomed Eng. 2006;53:1394–1400. 14. Zhang Y, Bieging E, Henry T, Jiang JJ. Efficient and effective extraction of vocal fold vibratory patterns from high-speed digital imaging. J Voice. 2010;24:21–29. 15. Gonzalez RC, Woods RE. Digital Image Processing. New York, NY: Prentice Hall; 2002. 16. Lindeberg T. Detecting salient Blob-like image structures and their scales with a scale-space primal sketch: a method for focus-of-attention. Int J Comput Vis. 1993;11:283–318. 17. Wood SA, Hoford JD, Hoffman EA, Zerhouni E, Mitzner W. A method for measurement of cross sectional area, segment length, and branching angle of airway tree structures in situ. Comput Med Imag Graph. 1995;19: 145–152. 18. Larsson H, Herteg ard S. Calibration of high-speed imaging by laser triangulation. Logoped Phoniatr Vocol. 2004;29:154–161.

Vocal fold vibratory patterns in tense versus lax phonation contrasts.

Vocal fold paresis accompanying vocal fold polyps.

Consideration of vocal fold position in unilateral vocal fold paralyses.

Bilateral vocal fold immobility.

Vocal evaluation of thyroplastic surgery in the treatment of unilateral vocal fold paralysis.

Vocal fold stripping.

Objective quantification of pre- and postphonosurgery vocal fold vibratory characteristics using high-speed videoendoscopy and a harmonic waveform model.

Use Videostrobokymography to Quantitatively Analyze the Vibratory Characteristics Before and After Conservative Medical Treatment of Vocal Fold Leukoplakia.

Evaluation of Dying Vocal Fold Epithelial Cells by Ultrastructural Features and TUNEL Method.

Electroglottography and vocal fold physiology.

Paradoxic vocal fold movement disorder.

Comparison of voice outcomes after trial and long-term vocal fold augmentation in vocal fold atrophy.

Classification for animal vocal fold surgery: resection margins impact histological outcomes of vocal fold injury.

Vocal Fold Vibration in Vocal Fold Atrophy: Quantitative Analysis With High-Speed Digital Imaging.

Videostroboscopy of human vocal fold paralysis.

Quantification of Porcine Vocal Fold Geometry.

Permeability of canine vocal fold lamina propria.

Cytoskeleton of newborn vocal fold stellate cells.

Vocal fold dynamics for frequency change.

Reinke Edema: Watch For Vocal Fold Cysts.

Marathon despite unilateral vocal fold paralysis.

Bioengineered vocal fold mucosa for voice restoration.

Vocal fold paralysis secondary to phonotrauma.

Vocal fold paralysis following spinal anesthesia.