INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN BIOMEDICAL ENGINEERING Int. J. Numer. Meth. Biomed. Engng. 2015; e02715 Published online 17 April 2015 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cnm.2715

Tracing vocal fold vibrations using level set segmentation method Tailong Shi1, Hyun June Kim1, Thomas Murry2, Peak Woo3 and Yuling Yan1,*,† 1

Department of Bioengineering, Santa Clara University, Santa Clara, CA, USA 2 Department of Otorhinolaryngology, Cornell University, NY, USA 3 Department of Otolaryngology, The Mount Sinai Medical Center, NY, USA

SUMMARY High-speed digital imaging (HSDI) of the larynx can provide important information on the vocal fold kinematics. This information is useful and may provide a better understanding of the mechanism of phonation and assist clinical assessment of voice disorders. Automatic tracing of the vocal fold vibration is a key step in the kinematic analysis and for correlative characterization of vocal fold vibrations and voice quality in normal and diseased states. In this study, we introduce a new approach for image segmentation and automatic tracing of vocal fold motion that combines the level set method and motion cue. This approach is applied to videokymogram (VKG)-form images, which are obtained from a sequence of laryngeal images captured using the HSDI. To utilize the motion cue for a more effective level set based segmentation on the VKG, we first construct a so-called standard deviation (STD) image by mapping the pixel-based measure of temporal intensity dispersion from the initial HSDI sequence. The STD image maps the extent of vocal fold motion, and followed by threshold operation, a region of interest (ROI) that encloses vocal fold motion, or glottal region, is identified. The performance and effectiveness of our approach are evaluated by using clinical datasets representing both normal and pathological voice conditions. Copyright © 2015 John Wiley & Sons, Ltd. Received 31 May 2014; Revised 08 March 2015; Accepted 11 March 2015 KEY WORDS: high-speed laryngeal imaging; videokymogram; level set image segmentation; vocal fold

motion; region of interest

1. INTRODUCTION High-speed digital imaging (HSDI) of the larynx, also referred to as high-speed laryngeal imaging, or high-speed videoendoscopy, has now become a clinical reality. The HSDI systems typically record images of the vibrating vocal folds at 2000 ~ 4000 frames per second, fast enough to resolve a specific, sustained phonatory vocal fold oscillation (100 ~ 400 Hz in speaking voice) [1]. Vocal fold vibration is the key dynamic event responsible for voice production, and the vibratory characteristics of the vocal folds correlate closely with voice quality and voice condition. However, credible analysis and characterization of the vocal fold vibration require accurate and effective extraction of the vocal fold displacements from the sequence of glottis images captured by the HSDI. One way to extract vocal fold motion is to segment the glottis image frame-by-frame using threshold, region-grow [2], snake-based, the active contour methods, or watershed transform-based method [3–6]. After the glottis contours are delineated from sequential segmentation process, both glottis area waveform and vocal fold displacements at specific location can be obtained. Videokymography is an alternative laryngeal imaging modality [7] in which each vibratory glottis cycle is documented in terms of a sequence of several images, and can be acquired directly from a single-line camera, or extracted from a sequence of images captured by the HSDI system [8]. The level set segmentation method has recently been applied to the glottis segmentation on a *Correspondence to: Yuling Yan, Department of Bioengineering, Santa Clara University, Santa Clara, CA, USA. † E-mail: [email protected] Copyright © 2015 John Wiley & Sons, Ltd.

(1 of 9) e02715

e02715 (2 of 9)

T. SHI ET AL.

frame-by-frame basis [9, 10]. However, the complexity of the level set algorithm and the time consuming implementation become an issue when applying the level set segmentation to thousands of image frames contained in the HSDI recording. In this paper, we use single images in form of videokymogram (VKG) that are constructed using a sequence of HSDI images for viewing glottal cycles and local vocal fold motions. The level set segmentation method is applied to individual VKG image to extract the bilateral vocal fold displacements at specific locations. 2. METHODOLOGY While applying the level set method for the segmentation of VKG image, we need to address two issues: 1), where to set the initial contour for the level set iteration, and 2), how to eliminate the interference of the background (outside of the glottal region). To address these issues, we introduce the so-called STD (standard deviation) image, derived from the vocal fold motion cue. The STD image provides a map of the vocal fold motion based on which a region of interest (ROI) can be identified. Subsequently, the segmentation process can be performed within the ROI applied to the VKG image. The proposed general approach is illustrated in the flow chart (Figure 1). 2.1. VKG image To generate the VKG, we select an image frame and a measuring line position (left, Figure 2), and then generate a composite image by arranging the selected line images from a sequence of successive image frames in the HSDI recording in vertical axis representing frame number or time; the time between two successive image frames is 0.5 msec (with an acquisition rate of 2000 frames per second). The resultant composite image, known as videokymogram, or VKG, is a unique and at-a-glance display of multiple glottis cycles (right, Figure 2). VKG allows visualization of key features of vocal fold vibrations, for example aperiodic vibration and left–right asymmetry among other important characteristics within a single image. 2.2. Glottis segmentation using level set method Active contour models implemented via level set methods [11, 12] have been proposed to solve numerous image segmentation problems [9, 13–15]. The basic idea is that a contour C in a domain Ω can be represented by the zero level set of a higher level embedding function Φ : Ω → ℝ. Rather

Figure 1. The flow chart of the proposed approach. HSDI, high-speed digital imaging; ROI, region of interest; STD, standard deviation; VKG, videokymogram.

Figure 2. The videokymogram (VKG) image extracted from high-speed digital imaging (HSDI) dataset of a patient with right vocal fold sulcus vocalis. Left: single frame of the HSDI sequence; Right: the resultant VKG image. The white line on the left image indicates the scanning position for the VKG. Copyright © 2015 John Wiley & Sons, Ltd.

Int. J. Numer. Meth. Biomed. Engng. 2015; e02715 DOI: 10.1002/cnm

TRACING VOCAL FOLD VIBRATIONS USING LEVEL SET SEGMENTATION METHOD

e02715 (3 of 9)

than directly evolving the contour C, one evolves the level set function Φ. The embedding function Φ is defined as the signed distance function (SDF) with Φ > 0 inside the shape, Φ < 0 outside the shape, and | ∇ Φ | = 1 almost everywhere. The evolution of the level set function Φ is governed by a partial differential equation (PDE). Consider the following curve evolution equation: ⇀ dC ¼ FN; dt

(1) →

which evolves the curve C according to the speed field F in the normal direction N . This evolution is achieved by numerically solving the following PDE on the regular grid dΦ þ Fj ∇ Φ j ¼ 0 dt

(2)

In particular, an improved algorithm introduced by T. Chan and L. Vese [15] was applied to the VKG image for the glottis contour detection. Because their model is not based on an edge function to stop the evolving curve on the desired boundary, thus the algorithm could be applied to image with noisy background, which is common with laryngeal images Figure 3 (b) shows the VKG after edge contour detection using level set method. Although the original VKG image shown in Figure 3 (a) is embedded with interferences as observed in the original VKG image, we still detected a reasonably accurate contour using the level set method. For comparisons, the results obtained from Otsu method [16] and region-growing method are shown in Figure 3 (c and d). It is clear that the level set method is the most effective and robust in a noisy background. The region-growing method fails to detect disconnected regions in the VKG. However, two problems were encountered when the level set method was applied to more VKG images. One is the automatic selection of the initial contour position in level set method. Figure 4 shows the results of segmentation with different choices of initial contour position. Although the results are significantly improved after manually adjusting the position of the initial contour, as shown in the lower image (Figure 4 (b)), an automatic selection is desired; for example, to place the initial contour always at the center of the image. The other problem is the interference from the background. The example in Figure 4 (b) shows erroneous contour detection owing to the background. To address the previously mentioned problems, we define an ROI to restrict the segmentation within the ROI. The ROI is constructed using a so-called STD image as introduced in the following section.

Figure 3. Videokymogram images. (a) Before segmentation. (b) After the segmentation using level set method, 100 iterations. (c) After Otsu threshold segmentation. (d) After region-growing segmentation. Copyright © 2015 John Wiley & Sons, Ltd.

Int. J. Numer. Meth. Biomed. Engng. 2015; e02715 DOI: 10.1002/cnm

e02715 (4 of 9)

T. SHI ET AL.

Figure 4. Results of segmentation using level set method. (a) Initial contour centered at the middle. (b) Manually adjusted initial contour.

2.3. Define region of interest using standard deviation image In normal voicing, the vibrating vocal folds open and close quasi-periodically, thus, causing a change in gray-scale brightness level within the glottal region, or region of the vocal fold opening. In contrast, there is little change in gray-level values outside of the glottal region. This motion cue can be effectively used to improve the accuracy and efficiency of the VKG segmentation. The STD image is constructed based on the intensity variation in sequential glottis images at corresponding spatial location. The STD (x, y) value for each pixel location (x, y) of the STD image is calculated as follows: N qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  2 1X Ii  I ; Stdðx; yÞ ¼ N i¼1

(3)

where N is the number of image frames used to construct the STD image, Ii represents the pixel intensity value in the sequential images at the same location (x, y), and Ī is the average value of Ii. It is clear that the STD (x, y) value indicates the extent of the intensity variation at each pixel location (x, y) over the glottis cycles. We used a sequence of 60 images (approximately 5 glottis cycles) to construct the STD image (displayed in pseudo color in Figure 5). This image is used to identify a region enclosing motion of the vocal folds. A brighter pixel in the STD image refers to a greater value of STD (x, y) calculated from the intensity profile of the pixel. Figure 5 (b) shows the intensity profile of a bright pixel location labeled 1 in Figure 5 (a), in contrast, the black pixel at a location labeled 2 has a smaller STD value as calculated from its intensity profile as shown in Figure 5 (c). We can observe that the bright region in the STD image represents the region of interest where the vocal fold motion occurs, while the darker region represents the background. To define the boundary of maximum glottal region, first we obtain the binary image by performing Otsu threshold segmentation to the STD image followed by applying morphological operation (an open operation with a 12 × 12 square template) to remove the isolated areas. This is followed by defining the boundary of the remaining region of the binary image to serve as ROI. The results are shown in Figure 6 where the outline of the region of maximum motion is delineated

Figure 5. (a) Standard deviation (STD) image. (b) Intensity profile of point 1 in the STD image. (c) Intensity profile of point 2 in the STD image. Copyright © 2015 John Wiley & Sons, Ltd.

Int. J. Numer. Meth. Biomed. Engng. 2015; e02715 DOI: 10.1002/cnm

TRACING VOCAL FOLD VIBRATIONS USING LEVEL SET SEGMENTATION METHOD

e02715 (5 of 9)

Figure 6. (a) Standard deviation (STD) image. (b) Binary image obtained from STD image. (c) Binary image after morphological operation. (d) The rectangle representing the region of interest is overlaid onto the initial glottis image.

(Figure 6 (d)). Because the ROI encloses entire or part of the glottis where vocal fold motion occurs, it can help solve the two problems mentioned previously. After restricting the sequence of HSDI images to the ROI, we obtained a truncated VKG image. This resultant new VKG image eliminates the unwanted background while enclosing the region where vocal fold motions occur. Figure 7 shows the VKG image extracted from a sequence of 60 images within the ROI. With the VKG image restricted within the ROI, we can automatically select the center of the image as the position of the initial contour, thus addressing the issue encountered in the contour detection using level set method. Furthermore, because the background is now minimized, the accuracy in segmentation is significantly improved. 2.4. Stop rule for level set segmentation A stop rule is developed for automatic determination of the number of iterations required for the level set segmentation method. This will significantly reduce the processing time. As illustrated in Figure 8, when the number of iteration increases (Figure 8 (a–d)), the segmented contour of the VKG image better approaches the actual contour (by inspection). However, the CPU processing time grows nonlinearly with the number of iteration. Here, we implemented an automatic stop rule to command the algorithm to stop once reaching the resultant contour with desired accuracy. In this way, the computational efficiency can be significantly improved.

Figure 7. Videokymogram (VKG) image extracted from a sequence of 60 high-speed digital imaging images within region of interest (ROI). Left: one image frame in the sequence–the rectangle indicates the ROI, and the mid-line indicates the line position for the VKG; Right: the resultant VKG. Copyright © 2015 John Wiley & Sons, Ltd.

Int. J. Numer. Meth. Biomed. Engng. 2015; e02715 DOI: 10.1002/cnm

e02715 (6 of 9)

T. SHI ET AL.

Figure 8. Level set segmentation on videokymogram images after different number of iterations. (a) 3 iterations, CPU time = 0.0313. (b) 8 iterations, CPU time = 0.1094. (c) 27 iterations, CPU time = 0.3162. (d) 70 iterations, CPU time = 1.6406.

Figure 9 shows the relationship between the area of detected regions and the number of iterations. As the number of iteration increases to a certain point, the detected area remains almost unchanged, indicating that the iteration should terminate. Accordingly, the area change in the segmented region is selected and used as the stop rule for the level set segmentation algorithm, in which the algorithm terminates when the area change becomes less than a threshold (e.g. 0.05%). For the example shown in Figure 8, the algorithm stops after 27 iterations with the implemented stop rule. As shown in Figure 8 (c), the segmented contour after 27 iterations matches very closely with the actual contour (by inspection), and is almost the same as the contour obtained after 70 iterations. Evidently, the implementation of the stop rule will significantly improve the computational efficiency. It is worth mentioning that the problems associated with the selection of initial curve position as well as the background interference may prevent the stop rule from working effectively. However, this issue can be solved with the ROI restriction.

1000

detected area

800

(d) 70 iterations

(c) 27 iterations

600

(b) 8 iterations

400

(a) 3 iterations

200 0 0

50

100

150

number of iterations

Figure 9. Relationship between the area of detected regions and the number of iteration.

Figure 10. (a) Single image frame of the high-speed digital imaging (HSDI) sequence acquired from a patient with left vocal fold paresis. The white line indicates the scanning position for the videokymogram. (b) The videokymogram (VKG) image extracted from HSDI dataset. (c) The resultant VKG image using level set method. Copyright © 2015 John Wiley & Sons, Ltd.

Int. J. Numer. Meth. Biomed. Engng. 2015; e02715 DOI: 10.1002/cnm

TRACING VOCAL FOLD VIBRATIONS USING LEVEL SET SEGMENTATION METHOD

e02715 (7 of 9)

3. RESULTS OF ANALYSIS The new approach has been applied to the analysis of 31 clinical HSDI datasets acquired from 13 normal and 18 pathological subjects. The segmentation strategy was successful in 25 of the 31 datasets, corresponding to a success rate of 80.6%. It is typical to find images in the HSDI recording with high background noise, which in extreme cases severely compromised image quality and caused segmentation failure. The effectiveness of our approach is not dependent on whether the

Figure 11. (a) Standard deviation (STD) image. (b) Binary image obtained from STD image. (c) Binary image after morphological operation. (d) The rectangle representing the ROI is overlaid onto the initial glottis image.

Figure 12. (a) One frame of the high-speed digital imaging (HSDI) sequence. The rectangle indicates the region of interest (ROI), and the mid-line indicates the line position for the videokymogram (VKG). (b) The VKG image extracted from HSDI sequence within ROI. (c) The resultant VKG, CPU time = 0.6094.

Figure 13. Videokymogram image after segmentation. Left: normal voice condition, CPU time = 0.4502; Right: vocal pathology, CPU time = 0.3438. Copyright © 2015 John Wiley & Sons, Ltd.

Int. J. Numer. Meth. Biomed. Engng. 2015; e02715 DOI: 10.1002/cnm

e02715 (8 of 9)

T. SHI ET AL.

VKG is captured from pathological or normal vocal fold vibrations. Furthermore, issues regarding the selection of initial contour position exist when the level set segmentation method is applied to the constructed VKG (Figure 10 (c)). To address this problem, the ROI is identified using the STD image, as shown in Figure 11 (a). After obtaining the binary image of the STD image (Figure 11 (b)) and applying a subsequent morphological operation (Figure 11 (c)), the boundary of the remaining region of the binary image is defined as ROI (Figure 11 (d)). The ROI helps to solve the previously mentioned problem by eliminating unwanted background and emphasizing the region of vocal fold motions. The resultant segmented VKG is shown in Figure 12 (c). Figure 13 shows two VKG images representing normal voice condition (left) and vocal pathology (right, unilateral vocal fold scar) respectively. After restricting the ROI area on the HSDI image sequence, the level set method successfully detected the edge contour of the VKG images. It is

Figure 14. Segmented videokymogram (VKG) and extracted left/right vocal fold displacements from highspeed digital imaging (HSDI) recordings of a patient with unilateral vocal fold paralysis (at right side) pre and post treatment (a and c, respectively). (a) Segmented VKG from the patient before operation (HSDI data recorded on April 13, 2010), CPU time = 0.3438. (b) Bilateral vocal fold displacements extracted from the segmented VKG. The vocal fold displacements at specific line position as represented in VKG are acquired from the segmented contour of the VKG image. For each line, the lower horizontal (x) displacement of the contour is recorded as the right vocal fold displacement while the higher horizontal displacement is recorded as the left vocal fold displacement. The red line (upper) and the blue line (lower) represent the left and right vocal fold displacement respectively. (c) Segmented VKG image from the same patient after operation (HSDI data recorded on February 16, 2011), CPU time = 0.6094. (d) Left/right vocal fold displacements extracted from the VKG image. The red line (upper) and the blue line (lower) represent the left and right vocal fold displacement respectively. Copyright © 2015 John Wiley & Sons, Ltd.

Int. J. Numer. Meth. Biomed. Engng. 2015; e02715 DOI: 10.1002/cnm

TRACING VOCAL FOLD VIBRATIONS USING LEVEL SET SEGMENTATION METHOD

e02715 (9 of 9)

observed from the VKG that left–right asymmetry existed in the patient’s bilateral vocal fold vibrations, while the vibrations are almost symmetric in the left and right folds in the normal subject. Figure 14 shows the VKG images and the extracted left/right vocal fold displacements obtained from the same patient with right vocal fold paralysis examined before and after injection laryngoplasty. As shown in the figure, the left–right asymmetry of the vocal fold movement for this patient has shown significant improvement, indicating that the patient has a successful recovery from the treatment. 4. CONCLUSION Tracing vocal fold motion is a key step in the quantitative analysis of vocal fold kinematics and for the effective assessment of voice disorders. A new approach is introduced that combines level set method and vocal fold motion cue to perform image segmentation on VKGs that are generated from image sequences captured with HSDI. The VKG segmentation is critical, as it allows for quantitative analysis and correlative characterization of vocal fold vibrations with voice pathology. The vocal fold motion cue is actively utilized prior to applying the level set algorithm in order to identify an ROI for a more effective and efficient segmentation. In particular, the ROI is derived from the STD image, which represents the measure of temporal intensity variation (in STD value) at each pixel location within the laryngeal image frame. The new approach has been applied to multiple clinical HSDI datasets representing both normal and pathological voice conditions, and has shown to be effective in generating improved segmentation outcomes. REFERENCES 1. Yan Y, Ahmad K, Kunduk M, Bless D. Analysis of vocal fold vibrations from high-speed laryngeal images using Hilbert transform-based methodology. Journal of Voice 2005; 19(2):161–175. doi:10.1016/j.jvoice.2004.04.006. 2. Yan Y, Chen X, Bless D. Automatic tracing of vocal-fold motion from high-speed digital images. IEEE Transactions on Biomedical Engineering 2006; 53(7):1394–1400. doi:10.1109/TBME.2006.873751. 3. Yan Y, Du G, Zhu C, Marriott G. Snake based automatic tracing of vocal-fold motion from high-speed digital images. IEEE International Conference on Acoustics, Speech and Signal Processing 2012; 2012:593–596. doi:10.1109/ICASSP.2012.6287953. 4. Karakozoglou SZ, Henrich N, d’Alessandro C, Stylianou Y. Automatic glottal segmentation using local-based active contours and application to glottovibrography. Speech Communication 2012; 54(5):641–654. doi:10.1016/j. specom.2011.07.010. 5. Manfredi C, Bocchi L, Bianchi S, Migali N, Cantarella G. Objective vocal fold vibration assessment from videokymographic images. Biomedical Signal Processing and Control 2006; 1(2):129–136. doi:10.1016/j. bspc.2006.06.001. 6. Osma-Ruiz V, Godino-Llorente JI, Sanenz-Lechon N, Fraile R. Segmentation of the glottal space from laryngeal images using the watershed transform. Computerized Medical Imaging and Graphics 2008; 32(2):193–201. DOI: doi:10.1016/j.compmedimag.2007.12.003. 7. Svec JG, Schutte HK. Videokymography: high-speed line scanning of vocal fold vibration. Journal of Voice 1996; 10(2):201–205. doi:10.1016/S0892-1997(96)80047-6. 8. Larsson H, Hertegard S, Lindestad PA, Hammarberg B. Vocal fold vibrations: high-speed imaging, kymography and acoustic analysis. Laryngoscope 2000; 110(12):2117–2122. doi:10.1097/00005537-200012000-00028. 9. Skalski A, Zielinski T, Deliyski D. Analysis of vocal folds movement in high speed videoendoscopy based on level set segmentation and image registration. International Conference on Signals and Electronic Systems 2008:223–226. doi:10.1109/ICSES.2008.4673399. 10. Qin X, Wang S, Wan M. Improving reliability and accuracy of vibration parameters of vocal folds based on high speed video and electroglottography. IEEE Transactions on Biomedical Engineering 2009; 56(6):1744–1754. doi:10.1109/TBME.2009.2015772. 11. Sethian J. Level Set Methods and Fast Marching Methods: Evolving Interfaces in Computational Geometry, Fluid Mechanics, Computer Vision, and Materials Science. Cambridge University Press: Cambridge, UK, 1999. 12. Osher S, Fedkiw R. Level Set Methods and Dynamic Implicit Surfaces. Springer: New York, 2002. 13. Kichenassamy S, Kumar A, Olver P, Tannenbaum A, Yezzi A. Gradient flows and geometric active contour models. Proceedings of IEEE international Conference on Computer Vision 1995:810–815. doi:10.1109/ICCV.1995.466855. 14. Bertalmio M, Sapiro G, Randall G. Morphing active contours: A geometric approach to topology-independent image segmentation and tracking. Proceedings. 1998 International Conf erence on Image Processing 1998; 3:318– 322. doi:10.1109/ICIP.1998.999021. 15. Chan TF, Vese LA. Active contours without edges. IEEE Transactions on Image Processing 2001; 10(2):266–277. doi:10.1109/83.902291. 16. Otsu N. A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cyberectics 1979; 9(1):62–66. doi:10.1109/TSMC.1979.4310076. Copyright © 2015 John Wiley & Sons, Ltd.

Int. J. Numer. Meth. Biomed. Engng. 2015; e02715 DOI: 10.1002/cnm

Tracing vocal fold vibrations using level set segmentation method.

High-speed digital imaging (HSDI) of the larynx can provide important information on the vocal fold kinematics. This information is useful and may pro...
2MB Sizes 3 Downloads 9 Views