Estimation of inferior-superior vocal fold kinematics from high-speed stereo endoscopic data in vivo.

Estimation of inferior-superior vocal fold kinematics from high-speed stereo endoscopic data in vivo David E. Sommer Department of Mechanical and Mechatronics Engineering, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1, Canada

Isao T. Tokudaa) Department of Mechanical Engineering, Ritsumeikan University, 1-1-1 Nojihigashi, Kusatsu, Shiga 525-8577, Japan

Sean D. Peterson Department of Mechanical and Mechatronics Engineering, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1, Canada

Ken-Ichi Sakakibara Department of Communication Disorders, Health Sciences University of Hokkaido, 2-5 Ainosato, Hokkaido 002-8072, Japan

Hiroshi Imagawa, Akihito Yamauchi, Takaharu Nito, and Tatsuya Yamasoba Department of Otolaryngology, University of Tokyo Hospital, 7-3-1 Hongo, Tokyo 113-8655, Japan

Niro Tayama Department of Otolaryngology, Head and Neck Surgery, National Center for Global Health and Medicine Hospital, 1-21-1 Toyama, Tokyo 162-8655, Japan

(Received 12 June 2014; revised 8 October 2014; accepted 15 October 2014) Despite being an indispensable tool for both researchers and clinicians, traditional endoscopic imaging of the human vocal folds is limited in that it cannot capture their inferior-superior motion. A three-dimensional reconstruction technique using high-speed video imaging of the vocal folds in stereo is explored in an effort to estimate the inferior-superior motion of the medial-most edge of the vocal folds under normal muscle activation in vivo. Traditional stereo-matching algorithms from the field of computer vision are considered and modified to suit the specific challenges of the in vivo application. Inferior-superior motion of the medial vocal fold surface of three healthy speakers is reconstructed over one glottal cycle. The inferior-superior amplitude of the mucosal wave is found to be approximately 13 mm for normal modal voice, reducing to approximately 3 mm for strained falsetto voice, with uncertainty estimated at r 2 mm and r 1 mm, respectively. Sources of error, and their relative effects on the estimation of the inferior-superior motion, are considered and recommendations are made to improve the technique. C 2014 Acoustical Society of America. [http://dx.doi.org/10.1121/1.4900572] V PACS number(s): 43.70.Jt, 43.70.Gr, 43.70.Aj [BHS]

I. INTRODUCTION

High-speed endoscopic video imaging of the human vocal folds has helped elucidate and improve our understanding of the complex vocal fold motion that occurs during speech.1–4 Endoscopic video offers the advantage of being a relatively non-invasive technique for observing the vocal folds in vivo, while still allowing for relatively normal muscle activation. Traditional methods of endoscopic imaging have been limited in that they capture only a twodimensional (2-D) projection in the transverse plane of the vocal fold motion. The vocal fold kinematics, however, suggest the importance of three-dimensional (3-D) dynamics with substantial inferior-superior motion. For instance, eigenmode analysis of continuum models of the vocal folds reveal richness of the normal mode spectrum in which, in a)

Author to whom correspondence should be addressed. Electronic mail: [email protected]

3290

J. Acoust. Soc. Am. 136 (6), December 2014

Pages: 3290–3300 addition to the lateral movement, the inferior-superior motion contributes significantly to support the self-sustained vibrations.5 In singing voice, the addition of the inferiorsuperior degree of freedom aids in stabilizing the vocalization over a wide frequency range.6 For the purposes of diagnosing voice pathologies, the addition of out-of-plane endoscope information, that is, motion orthogonal to the imaging plane, would provide clinicians with an enhanced ability to recognize vocal fold vibration regimes typically associated with known pathologies. Several attempts have been made toward measuring the 3-D kinematics of the vocal folds. The medial surface has been observed in excised larynges,7 as well as in in vivo canine larynges.8 These approaches, which require suturing small markers to the vocal folds, are not applicable to human subjects, however. As a non-invasive technique, bulk behavior, such as body-layer movement of the vocal folds, has been observed by high-frame-rate ultrasound9 and stroboscopic x-ray laminagraphy.10 These techniques are, however, not

0001-4966/2014/136(6)/3290/11/$30.00

C 2014 Acoustical Society of America V

sensitive enough to capture detailed movements of the vocal fold medial edge. Laser systems, on the other hand, have been applied to quantitatively measure the vocal fold motion in conjunction with a high-speed video camera. For laser systems focused on a single point,11,12 two points,13 or a line,14,15 proper positioning of the limited target points on the vocal folds is paramount, as the inferior-superior behavior depends upon the measurement location on the vocal folds. This difficulty is resolved by a system of laser dots uniformly distributed on a planar area.16 The laser grid system, however, requires a complicated calibration procedure for every setting of the laser and camera and has yet to be applied in vivo. Furthermore, for the purposes of capturing detailed surface kinematics, such as tracking the moving location of the medial edge, laser systems provide only point-wise information and cannot track the medial edge when it falls between laser dots. Optical coherence tomography (OCT) has also been used to reconstruct in vivo vocal fold motion at a point.17 The advantage of OCT methods is that they can penetrate into the vocal fold tissue approximately 2 mm and show the motion of both the epithelium and underlying tissue. Triggered OCT has been successful in producing a temporally varying 3-D reconstruction of the vocal fold tissue in excised calf larynges;18 however, the 3-D reconstruction necessitates scanning the tissue over many periodic cycles, and significant artifacts are introduced if the signal deviates from perfect periodicity.17 As an alternative approach for non-invasive 3-D measurement, this paper focuses on a stereo-endoscopic observation of the vocal folds.19–21 Stereo-endoscopy provides two views of the larynx from different angles, enabling 3-D viewing of the laryngeal structure. Stereo matching techniques, widely developed in the field of computer vision, can be used to automatically recover 3-D kinematics of the vocal folds in vivo. The endoscope has outer dimensions that allow for unobstructed views of the larynx through the mouth, while still allowing subjects to produce relatively natural vowel sounds and abduct and adduct the vocal folds as normal. Alignment and calibration do not need to be redone for each patient and thus allow for easy adaptation to clinical use. Without any special training, surgeons can handle the stereo-endoscope in the same way as a standard endoscope. In contrast to our preliminary result,22 which applied a primitive version of the stereo matching technique to a single data set without any consideration of the estimation error, this paper presents a comprehensive study with an improved algorithm for reconstructing the 3-D kinematics, a thorough examination of the reconstruction error, and validation using several data sets recorded from different subjects. The medial edge of the vocal folds, contrasted against the dark open glottis, provides one of the few high contrast features needed for robust stereo-matching. While the methods mentioned previously measure the kinematics of a given region of the vocal fold surface, the technique presented herein seeks to measure the kinematics of the tissue that forms the medial most edge of the vocal folds at a given time in the vibratory cycle. Observing the motion of the medial surface of the vocal folds during speech is of J. Acoust. Soc. Am., Vol. 136, No. 6, December 2014

particular interest as it is the surface that interacts with the airflow from the lungs. The inferior-superior motion of the medial edge, indicating the change from a converging to diverging glottis, has been shown to be vital to the selfsustained oscillation of the vocal folds.23 II. METHODS A. High speed stereo-endoscope

A stereo-endoscope camera, manufactured by Nagashima Medical Instrument Corporation, is used to reconstruct the 3D structure of in vivo vocal fold motion. By providing two slightly different simultaneous views of the vocal folds, the out-of-plane distances can be estimated from the high speed stereo-endoscopic imagery. A schematic view of the endoscope head showing the orientation and field of view as well as the defined coordinate system used for 3-D reconstruction, is shown in Fig. 1. The stereo-endoscope includes two independent ordinary rigid optical systems with diameters of 4 mm, a fiber-optic light guide, an optical connector, a light source, and a camera. The tips of the optical systems house objective lenses with prisms designed for 70 oblique-angled view, with a field angle of 40 . The distance between the optical axes of the tips is 6 mm, and the outer dimensions of the endoscope head are 21 mm wide by 14 mm in height. The stereo-endoscope is attached to a lens with a 50 mm focal length. A rectangular coordinate system is defined with the origin at the tip of the left endoscope. The Z axis is along the optical axis of the left endoscope. The X axis passes through the two endoscope tips, and the Y axis is defined to form a right-handed coordinate system. B. Input data used in the present study

Our subject group was composed of three healthy male speakers with no evidence of laryngeal pathology as determined by the otolaryngology team at University of Tokyo Hospital. Their ages ranged from 22 to 36 years old. During the study, the subjects were asked to produce a normal phonation of the vowel /a/. Subjects 1 and 2 produced a chest voice, whereas subject 3 used a falsetto-like voice. Subjects 1 and 2 received no anesthetic, while subject 3 received a peroral spray of 4% lidocaine hydrochloride to help suppress the gag reflex. A Photron Fastcam 1024PCI camera was used for highspeed digital video recordings. The recordings were carried

FIG. 1. Schematic of the stereo endoscope, including the coordinate system employed in the 3-D reconstruction. Sommer et al.: Stereo endoscopy of human vocal folds

3291

out with an image resolution of 768 352 pixels, a frame rate of 2500 fps, and a maximum recording duration of 2.53 s. An acoustic signal from a microphone and glottal signal from an electroglottograph (EGG) were simultaneously recorded in addition to the video data. A Hamming window analysis was used to determine the fundamental frequency of the EGG and microphone data for each subject. Fundamental frequencies of 175, 178, and 326 Hz were observed for subjects 1–3, respectively. Glottal open quotients, usually defined as: Oq ¼ To/T, where T is the period and To is the duration of the glottal open phase, are estimated from the image data.24 Oq is estimated for each subject as Oq ﬃ no/n, where no is the number of image frames showing a visibly open glottis and n is the total number of images in one cycle. Open quotient estimates of 0.267, 0.571, and 0.875 were obtained for subject 1–3, respectively. The higher Oq and fundamental frequency obtained for subject 3 implies a characteristic of falsetto-like phonation, whereas low Oq estimates obtained for subjects 1 and 2 is typical of modal voice.21

III. RECONSTRUCTION OF VOCAL FOLD MOTION FROM HIGH SPEED CAMERA DATA

Extracting 3-D point data from a pair of stereo images relies on a pattern matching algorithm to locate the relative positions of image features between the left and right frames, referred to as stereo-matching.25,26 The pixel coordinates of the same feature in the left and right images are sought to be used in conjunction with known geometric and optical quantities to obtain the disparity information and thus estimate the position of the feature in 3-D space. A high level overview of the stereo matching procedure, as it is implemented in the present study, is given as Fig. 2. For each frame from the high-speed camera, left and right sub-images are extracted and pre-processed with a

Gaussian blur algorithm to mitigate noise contamination as well as histogram balanced to improve contrast and reduce lighting differences between the image pair. The preprocessed image pair becomes the input for the stereo-matching algorithm, which is described in more detail in the following sections. In brief, the algorithm traverses the left image, adaptively locating and sizing candidate match windows (Sec. III B 1) and seeks the corresponding location in the right image that minimizes the difference of luminosity intensities contained in the left and right windows (Sec. III B). A 3-D point cloud is generated from disparity data (Sec. III A) of all matched features in the left and right image, and these data are projected onto a 2-D plane (Sec. III C) to estimate the vertical motion of the vocal fold edges through a phonatory cycle. A phase averaging process is introduced as a means of improving the reconstruction resolution and accuracy (Sec. III B 2). A. Geometry and problem formulation

A typical stereo endoscopic image from the high-speed camera system is shown in Fig. 3. A global pixel-based coordinate system (x*, y*) is defined with the origin located at the top left of the captured image. Herein the superscript * will be used to denote variables referenced to the global image pixelbased coordinate system. From each image, two sub-regions of equal dimensions are extracted heuristically from the global image frame to capture the regions containing the vocal folds; these sub-regions are labeled as the left (L) and right (R) frames. The points (xLO , yLO ) and (xRO , yRO ) denote the origins of the left and right frames, respectively, The left and right frames are each assigned a local pixel-based coordinate system, (xL, yL) and (xR, yR), respectively. The primary quantities of interest for 3-D reconstruction are the vertical distance, DV, and horizontal distances, DL and DR, of the feature in the left and right frames as measured from the corresponding center of the optical field, denoted as (xLC , yLC ) and (xRC , yRC ) in the global frame and depicted in Fig. 3. The center of the optical field is found through calibration and remains constant for a given optical set-up. Given the pixel coordinates of a feature in the left and right frames, the quantities DL, DR, and DV are computed as DL ¼ xL0 þ xL xLC ;

FIG. 2. Flowchart giving high-level overview of stereo matching procedure. 3292

J. Acoust. Soc. Am., Vol. 136, No. 6, December 2014

(1a)

FIG. 3. (Color online) Definition of pixel-based coordinates for image extraction and stereo reconstruction. Dashed sub-regions demarcate the extracted left and right images. The cross indicates the position of a feature contained within both the left and right sub-image. Sommer et al.: Stereo endoscopy of human vocal folds

DR ¼ xR0 þ xR xRC ; DV ¼

yLC

0:5ðyL0

þ yL þ

(1b) yR0

þ yR Þ:

(1c)

The left-right disparity is defined as D ¼ DL DR. The Cartesian coordinates of the feature point with respect to the endoscope-based coordinate system (see Fig. 1) are then estimated, assuming no optical distortion, as Z¼

1 ; k1 D þ k2

(2a)

X ¼ k3 ZDL þ k4 ;

(2b)

Y ¼ k5 ZDV þ k6 ;

(2c)

where the constants k1 through k6 are calibrated for a given optical set-up.21,26 To account for optical distortion in the lens and camera systems, an expanded version of Eq. (2a) with additional constants and higher-order polynomials is used in the present work, given as Eq. (3), Z¼

k1 D þ

c1 D2L

1 : þ c2 DL þ c3 D2V þ c4 DV þ k2

EðxR ; yR Þ ¼

m X m X

jLðxL þ i; yL þ jÞ

i¼m j¼m

RðxR þ i; yR þ jÞj;

(4)

which computes the sum of differences in each pixel over a window of size (2 m þ 1) (2 m þ 1) centered in the left image at (xL, yL) and (xR, yR) in the right image.28 The function L(x, y) returns the luminosity value of the pixel located at (x, y) in the left image, and R(x, y) analogously for the right image. Determining the window size m is a trade-off between increasing the window size to include more pixels, which increases the uniqueness of the minima of E(xR, yR), versus decreasing the window size to offer finer resolution of scene depth. Smaller windows increase the algorithm’s sensitivity to any differences in the left and right images due to noise, which is of particular concern in regions of constant texture.28 The lack of high-contrast features erodes much of the advantage gained by more complex algorithms. Thus a relatively simple SAD algorithm with some modifications is considered herein.

(3)

The calibration constants ki and ci are optimized for the current camera setup as follows: Images of a 5 mm 5 mm square grid are captured at distances ranging from Z ¼ 22–92 mm. At each distance, the X, Y, Z coordinates of the grid points with known dimensions DL, DR, and DV are obtained. Calibration constants are then determined by the least squares method to best fit the data. The error and uncertainty of this calibration procedure are discussed further in Sec. IV.

1. Adaptive window sizing and positioning

In an effort to improve the performance of the stereomatching algorithm and to reduce false matches, an adaptive window algorithm is implemented. The process of adaptive window positioning and sizing is outlined in Fig. 4. The traditional SAD stereo-matching algorithm searches for matches at either every pixel or for some predefined grid.27 The endoscopic images contain large featureless regions; attempting to match non-identifiable and nonunique windows within these regions leads to many false and erroneous matches.28,29 The algorithm implemented

B. Adaptive window sum of absolute differences stereo-matching

Traditional stereo-matching techniques, pioneered in the computer vision and robotics fields, have been shown to perform well at reconstructing 3-D scenes containing buildings, rooms, landscapes, and other commonly encountered objects.27 The aforementioned scenes contain many edges and high contrast regions that allow the stereo-matching algorithm to identify corresponding features with considerable accuracy and precision. Laryngeal endoscopic images, conversely, contain few hard edges or high contrast regions for a stereo-matching algorithm to use as salient features. Given that the glottis is a smooth, continuous, largely monochromatic tissue surface sometimes hidden by reflective mucus, it is a poor subject for the traditional stereo-matching algorithms. Furthermore the inaccessibility of the glottis severely limits viewing angle and lighting options. One of the simplest stereo-matching algorithms, which is the foundation of the present work, is the sum of absolute differences (SAD) algorithm.25,26,28 The SAD algorithm seeks to minimize the matching error for a given location of a feature in the left image as compared to the corresponding feature at some location in the right image as J. Acoust. Soc. Am., Vol. 136, No. 6, December 2014

FIG. 4. Flowchart outlining the procedure of adaptive window sizing and positioning. Sommer et al.: Stereo endoscopy of human vocal folds

3293

herein thus biases match point selection toward image regions with higher intensity variance. Given an identified feature, an initial default window size m0 is defined to form a window of dimensions (2m0 þ 1) (2m0 þ 1) centered about a given match point (xL, yL) in the left image. The standard deviation of the luminosity of the pixels contained in the window is calculated, after which the window center is shifted within a set search distance. The location within the search region that maximizes the luminosity standard deviation is used as the match point. Adding this degree of flexibility in match point selection discourages the seeding of match points in regions of very low luminosity variation and allows windows to “grab” nearby features if they offer improved uniqueness. At the optimal position within the search radius, the standard deviation of the luminosity of the pixels contained in the default sized window is calculated and compared to a predefined threshold. If the standard deviation is below the threshold, that is r < rthresh, the window is enlarged by a factor g in each dimension (e.g., m ¼ m0 þ g), and the standard deviation is recalculated for the resulting enlarged window until the threshold is reached. The value of g is set to 1 pixel by default but can be increased to save computation time. A threshold value is set heuristically to control window growth in low-texture regions; because all images are histogram balanced before executing the stereo-matching, thresholds between rthresh ¼ 20 and 25 were found to give good results for all of the images evaluated in this study. Figure 5(a) shows an extracted left image of the vocal folds with the corresponding standard deviation of luminosity calculated with a 5 5 pixel window presented in Fig. 5(b). Executing a very sparse implementation of the feature matching algorithm described herein yields the window sizes and distribution shown in Fig. 5(c); it should be noted that the density of match points is much higher for reconstruction but is rarified in the figure for demonstration purposes. After a window position and size is determined for an initial trial point, the next trial point is selected by moving in the y direction by a distance proportional to the current window size. Calculating the distance between trial points as a

function of the current window size ensures that featureless low-texture regions are treated sparsely while strong features, like the vocal fold edge, are reconstructed to a much higher resolution. 2. Phase averaging to improve reconstruction

With the relatively low resolution and noisy high-speed imagery, uncertainty in camera calibration, and stereomatching inhibited by a lack of features, the reconstructed 3-D data points contain a certain amount of estimation errors. In an effort to reduce the effect of these errors, a phase-averaging algorithm based on the idea of stroboscopy30,31 is implemented to take multiple samples of the scene at equivalent phase instances. By assuming that the frame rate of the camera is sufficiently high, the changes due to any large scale camera or glottal movements are neglected over short recording durations. This appears to be a reasonable assumption as camera movement was shown to remain on the order of one pixel for up to 100 consecutive frames or 40 ms at the frame rate used. A global timing signal that captures the fundamental frequency of the vocal fold oscillations is required to determine the phase angle of each image in a sequence. For simplicity, the videokymographic technique32 is used to extract the glottal width as the timing signal. Namely, a row of pixels is heuristically selected from a left image frame such that it crosses a region of the glottis, is free of mucus or lighting aberrations, and participates in the bulk oscillation of the vocal folds. The points of maximum and minimum rate of change of pixel luminosity moving across the row are assumed to be the extremities of the glottal width. Figure 6 shows the results of the glottal width estimation algorithm from the data of the three subjects in the present study. The first image with zero glottal width marks the beginning of each cycle; the index number of the image in which the glottal width crosses the zero line for the kth time is denoted hk. The period of each cycle is estimated by Tk ¼ (hkþ1 hk)/ FPS, where FPS is the frame rate of the high-speed camera. Because the frame rate is relatively low compared to the

FIG. 5. (Color online) Adaptive window SAD algorithm. (a) A typical extracted left image frame with open glottis, (b) the standard deviation of pixel luminosity calculated with a 5 5 interrogation window, and (c) the resulting window sizes and placements for a sparse implementation with m0 ¼ 2 and rthresh ¼ 21. 3294


Sommer et al.: Stereo endoscopy of human vocal folds

image but with higher matching density. This has the further advantage of averaging out noise because the phase-averaging results in multiple samples of essentially the same scene with distinct random image noise. From a time series of consecutive images, the phase-averaging algorithm outputs N sets of 3-D point cloud data, the analysis and interpretation of which are discussed in the following sections. C. Obtaining vertical movement of the vocal fold edges

FIG. 6. (Color online) Timing signal for phase averaging algorithm. Estimate of glottal width (in pixels) of each image in sequence for (a) subject 1 with complete glottal closure and strong collision, (b) subject 2 with good closure, and (c) subject 3 with weak glottal closure and higher frequency phonation. Circles indicate the point selected as the initial closing and set to 0 phase.

frequency of oscillation, the imagery based method of period estimation varies from cycle to cycle. To improve accuracy, the average period from all recorded cycles Tavg is used. Basing the phase-angle on the average period, rather than strictly the glottal width estimate of each image, reduces errors due to changes in lighting variation, image noise, and the relatively low frame rate compared to the vibration of the vocal folds. Comparing the average period estimate obtained from the glottal width method to those obtained from the EGG and microphone shows very good agreement between the three estimates, suggesting the glottal width method gives a reasonable timing signal to coordinate the phase averaging algorithm. The three subjects chosen in this study exhibited strongly periodic and regular vocal fold vibration. For subjects who exhibit irregular vibrations, this assumption of simple periodicity would not hold. A higher order estimate of the period of the signal would work if the signal still exhibits periodic behavior. For aperiodic vibrations, the reconstruction would not benefit from application of phase averaging. With the period and time between sequential images known, a phase angle /i is assigned to each image in the sequence, such that /i ¼ 360i/(FPS Tavg) mod(360). The image sequence is then sorted into N uniform bins of prescribed width D/. All images in each bin are of similar phaseangle and are analyzed by the stereo-matching algorithm. The results of the stereo- matching algorithm are aggregated for all images within a bin; thus a bin is effectively treated as a single J. Acoust. Soc. Am., Vol. 136, No. 6, December 2014

The primary quantity of interest is the inferior-superior motion of the medial surface of the vocal folds through the phonatary cycle.33,34 To obtain the vertical component of the motion of the medial edge of both vocal folds, coronal sectional projections of the 3-D reconstruction data onto the XZ plane are calculated. All data points that are contained in the region from Ymin < Yi < Ymax are projected to the XZ plane, where Ymin and Ymax define the planes that mark the extents of the section volume. A section volume is heuristically determined with the intent of obtaining a representative sample of the vocal fold motion in a region that is well constructed and free of mucus or aberrations. An XY projection of all points from a typical 3-D reconstruction during the open phase of the phonatary cycle is shown in Fig. 7(a) with the values of Ymin and Ymax used to extract the XZ projection; the XZ projection of all points contained in the section volume is shown in Fig. 7(b). As mentioned previously, the vertical location of the medial edges of both vocal folds is of interest; from the XZ

FIG. 7. (Color online) Cross-sectioning and manual vocal fold edge selection procedure. (a) Projection of all reconstructed points onto the XY plane, overlaid approximately over one image from the bin for demonstration, Y and X lines are shown to indicate the extents of the section volume. (b) Projection of the points contained in a section volume onto XZ plane, showing the regions selected as the left and right vocal fold edges and the average values from the points that fall within the left and right regions as ZL and ZR. Anatomical directions are indicated by the cross in the lower left of each figure. Sommer et al.: Stereo endoscopy of human vocal folds

3295

projection of the data contained in the section volume, the range of the region of points likely to represent vocal fold edges are manually selected. The vertical Z location of the left vocal fold medial edge is then taken to be the mean of all of the projected points, such that XL,min < Xi < XL,max where XL,min and XL,max define the extremities of the region that contains points believed to represent the left vocal fold edge, and XR,min < Xi < XR,max analogously for the right vocal fold edge. The final output of the reconstruction algorithm is thus the average Z location of the left and right vocal fold edges ZL and ZR. Bounds of the regions containing projections of all points to be averaged for the left and right vocal fold Z position, as well as the calculated mean values, are indicated in Fig. 7. To track the dynamics of the vocal fold edges throughout the phonatory cycle, the aforementioned projection and selection procedure is repeated for each of the N bins of phase averaged reconstructed data. The results of a typical output of the algorithm for a Y location near the middle of the length of the vocal folds are presented as Fig. 8; these results are for a given set of 100 images with D/ ¼ 10 , yielding 36 sampled values each for ZL(/) and ZR(/) across an averaged period of oscillation. As one would expect, the temporal reconstruction of the glottis is much smoother through phase angles where the glottis is open and the vocal fold edge is contrasted against the dark subglottal region. When the glottis is closed, the reconstruction estimates the superior most point on the plane of contact between the vocal folds. The crease created between the contacting left and right medial surfaces as they collide does not provide as high of contrast and as unique of a feature as the medial edge in the open glottis.

ability to obtain distance data from feature disparities, and (c) uncertainty in the stereo-matching algorithm’s estimate of disparity for each matching window due to noise and poor image quality. A. Sensitivity to choice of heuristic parameters

Reconstruction data are of the form of 3-D point clouds rather than a smooth continuous surface of the vocal folds. Thus it is necessary to collect and average all data points contained in a finitely thick volume projected to the XZplane to achieve a reliable estimate for the 3-D vocal fold edge location. The section volume and its location are manually selected based upon visual identification of clear and well-constructed regions. This implies that the proposed algorithm inherently depends upon the manual selection of the measurement zone, leading to variability in the estimation results. To evaluate the magnitude of this variation, the same reconstruction data is sectioned and projected for eight different selections of section volume, and the X bounds of the vocal fold edge regions are reselected for each case. The results of the vertical (inferior-superior) motion of the left fold edge for the various section volumes and edge locations are given in Fig. 9. Each position estimate gives the mean value of the points within the region selected as containing the left vocal fold edge for each XZ-projection. Although the absolute value of vertical position is indeed affected by the manual selection of the section volume and vocal fold edge locations, the global behavior of the vocal fold motion throughout a glottal cycle are retained. The range of inferior-superior position estimates obtained when the vocal fold edge is visible is approximately half that obtained for periods of glottal closure.

IV. MODELING RECONSTRUCTION UNCERTAINTY

The reconstruction gives an estimate of the inferiorsuperior position of the outer surface of the vocal folds throughout the glottal cycle. Three distinct sources of error have been identified and are considered in the following sections: (a) The heuristically selected section and projection parameters Ymin, Ymax, XL,max, XR,max, XL,min, and XR,min (b) the error inherent to the optical and imaging apparatus’

B. Estimating disparity error and noise sensitivity

As introduced in Sec. III, the Z position of a feature on the vocal folds is estimated by Eq. (3). Assuming that the disparity estimation and the calibration errors are uncorrelated,35 the variance of the measurement rZ2 can be expanded as a function of the input quantities as r2Z

¼

2 2 2 2 4 X X @Z @Z @Z 2 2 rD þ rk;i þ r2c;i ; @D @k @c i i i¼1 i¼1 (5)

FIG. 8. Typical reconstruction of left and right vocal fold edge Z position for one phonatary cycle for normal speech. 3296


where rD2 is the variance of the disparity estimates, and r2k;i and r2c;i represent the variances of the calibration constants. Casting the measurement uncertainty in this form is demonstrative, albeit limited in utility as the quantities r2k;i and r2c;i are not explicitly known from the calibration method implemented. Given that the calibration coefficients ci and ki are not directly measured quantities but rather optimized in the least-squares sense from clean calibration test data, the concepts of their uncertainty is less clear. The calibration error is thus treated separately in Sec. IV C. Removing the calibration error terms yields a useful expression for considering the impact of the disparity uncertainty on the estimation results of the Z position. Sommer et al.: Stereo endoscopy of human vocal folds

FIG. 9. (Color online) Vertical motion of left vocal fold edge for various section volumes and manual selections of vocal fold edge location. Each point gives the estimate of the vertical location of the left vocal fold edge based on a given user input for that phase angle. The dashed line gives the average of all eight attempts.

@Z rD : rZ @D

(6)

Substituting D into Eq. (3), the sensitivity coefficient for disparity @Z=@D is approximated, removing small terms, to be @Z k1 : @D k12 D2 þ 2k1 k2 D þ k22

(7)

This shows that the reconstruction accuracy is sensitive to errors in disparity estimate, and the level of sensitivity to disparity errors is a function of the disparity itself. This motivates the analysis and quantification of the performance of the stereo-matching algorithm that determines the behavior of the quantity rD2. As previously mentioned, the glottis is a rather poor candidate for stereo-matching, and the matching is therefore especially sensitive to image noise. To examine the performance of the matching algorithm for varying levels of image noise, an idealized synthetic image with known disparity is sought as a baseline. A highresolution consumer digital single lens reflex (DSLR) is attached to the laryngoscope in place of the high- speed camera. The DSLR (Nikon D600) has significantly improved dynamic range, resolution, and noise performance compared to the high-speed camera used in this study. An idealized image is created by capturing an image of the vocal folds with the DSLR, implementing a noise removal algorithm, converting to a gray-scale image, and then down sampling the high-resolution image to match the resolution and size of the high-speed images. This procedure yields an idealized image with virtually no noise compared to the high-speed images and thus offers a good baseline for comparison. The left sub-image from the idealized image is used as both the left and right images such that perfect performance of the stereo-matching algorithm would yield zero disparity at all test points. Zero-mean Gaussian random pixel noise is then added to the left and right images, forming a left-right image pair that consists of the same base image but with independently sampled random noise. The stereo-matching algorithm, with the same parameters used in the reconstructions, is implemented on the image pair for each noise level. Given that each image in a pair J. Acoust. Soc. Am., Vol. 136, No. 6, December 2014

varies only in its random noise, any deviation from zero disparity is attributable to the noise. The performance of the stereo-matching algorithm for increasing image noise levels is given in Fig. 10. With an increase in the image noise, a linear increase in the standard deviation of disparity estimate is observed. The results of Fig. 10 also suggest estimation of y disparity to be more than twice as sensitive to random noise effects as the x disparity estimate; this might be due to the inherent shape of the vocal folds. The long continuous vocal fold edge has little variation by shifting in the y direction but offers a sharp feature in the x direction. Analysis of the high-speed camera images suggests the image sensor noise is approximately equivalent to random

FIG. 10. (Color online) Standard deviation of the error in estimating the disparity between features in the left and right synthesized calibration images as a function of increasing random pixel noise for (a) horizontal disparity estimate and (b) vertical disparity estimate. The effect of noise mitigation with a Gaussian blur of strength rb is shown. Sommer et al.: Stereo endoscopy of human vocal folds

3297

additive pixel noise of 1%–2%, which yields uncertainty of disparity estimates to be approximately rD,x 0.5 pixels and rD,y 1 pixels. Real images are likely to produce greater uncertainty in disparity estimates as occlusion effects, changes in lighting, and cropped regions will vary between the two images, which will give invalid disparity estimates and consequently increased errors in the measurement. Using the estimate for the standard deviation of disparity measurements garnered from the synthetic images, an approximation of the standard deviation of Z reconstruction rZ can be estimated for each point. The average standard deviation of all reconstruction points is rZ 2.1, 2.0, and 0.74 mm for subjects 1–3, respectively. A singularity exists in Eq. (7) at D ¼ k2/k1, which corresponds to an object infinitely far from the endoscope head. Reconstruction thus becomes more sensitive to noise for objects that are further from the image head. The reconstruction points for subject 3, with the vocal folds closer to the endoscope head than the other two subjects, is less sensitive to errors in disparity estimation.

leading to 629 samples total. The error is approximately normally distributed, with mean l ¼ 0.015 mm and standard deviation r ¼ 0.754 mm. D. Precise quantification of combined uncertainties

The uncertainty of the vertical position estimates has been analyzed qualitatively and considered for some simple test scenarios. Precise quantification of the uncertainty, to provide accurate error bounds and confidence, for example, can only be achieved through a rigorous test set-up that simultaneously employs the use of the stereo-matching algorithm and calibrated 3-D reconstruction equations to a representative and independently measured vocal fold-like geometry. Either an excised or a synthetic larynx, under representative lighting conditions, would provide a subject that would realistically assess the performance of the stereomatching algorithm. Comparing the obtained 3-D reconstruction to an accurate mapping of the test subject would elucidate the behavior of the combined uncertainty. V. RESULTS AND DISCUSSION

C. Estimating Z-reconstruction error from calibration

Equation (5) demonstrates how error in the disparity estimation propagates through to the final Z estimate. The equation, however, does not directly take into account the sensitivity of the measurement to errors in the calibration coefficients. Instead residual error for the calibration data is evaluated. As detailed in Sec. II A, the 3-D grid data on the calibration grid are fitted to reconstruct the known grid positions and distance to the laryngoscope probe. The Z error is evaluated for imaging distances ranging from 62 to 92 mm at 5 mm intervals. At each imaging distance, the reconstruction accuracy is evaluated for several points across the frame,

The stereo-matching and reconstruction algorithm, as described in Sec. III, is applied to high-speed stereo endoscope data captured from the three subjects presented previously in Sec. II B. Reconstruction of the vertical positions of the left and right vocal fold edges over one phase averaged period are given as Fig. 11. The vertical dashed line on each subfigure indicates the approximate time of opening; that is, the first instance that a perceptible glottal gap is observed in the images. All three subjects are male speakers, but the range of fundamental frequency and the glottal width histories shown in Fig. 6 show considerable variation between the three phonatory cycles.

FIG. 11. Vertical motion of left (crosses) and right (circles) vocal fold edges of three subjects through one phase-averaged period. (a) Subject 1, (b) subject 2, and (c) subject 3. Refer to Fig. 6 for the temporal glottal width behavior of the three subjects.

3298



FIG. 12. (Color online) Comparison of vertical motion estimation between surface vibrometry and stereo-imagery methods. The maximum and minimum vertical position estimates are indicated for both the vibrometry (DZvib) and stereo-imaging (DZstereo) methods on the coronal section through a vocal fold undergoing the expected modal voice oscillations through time. Superior and inferior directions are indicated by arrows.

Subject 1, with a strong modal voice, shows firm, complete glottal closure with the lowest open quotient and lowest fundamental frequency. In the Z motion reconstruction for subject 1, the difference between minimum and maximum Z peaks is 12.6 mm for the left and 10.1 mm for the right vocal fold edge. The glottal width history of subject 2 shows slightly higher open quotient and fundamental frequency, and the vocal fold Z reconstruction varies by 8.7 mm for the left and 7.9 mm for the right vocal fold. Subject 3, phonating with a falsetto-like voice, shows significantly lower amplitude vertical oscillation with 4.2 mm for the left and 2.6 mm for the right vocal fold. These preliminary results show significant vertical motion of the vocal fold medial edge location through the oscillation cycle especially during strong modal voice. A reduced inferior-superior vibration amplitude and increased fundamental frequency are expected in falsetto register as compared to chest register with the vocal folds thinned and elongated under increased tension.24,36 The behavior of the vertical motion is consistent with the findings of previous in vivo, excised, and in vitro studies. The oscillation amplitude of the vertical motion is, however, an order of magnitude greater than what is typical of the literature.11,12,37 It is hypothesized that this is, at least in part, due to the subtle difference of what the two measurement methods actually measure. As described in previous sections, the algorithm presented herein takes advantage of the highcontrast glottal edge as a feature to obtain confident matching. Tracking the medial extremity of the vocal fold, as opposed to superior surface at some particular location, includes the mucosal wave in the estimate of vertical position, as shown schematically in Fig. 12. The vertical location of the medial extremity has been demonstrated to vary significantly in excised observation of the mucosal wave.38 Measuring the edge of the vocal folds in this manner is expected to add considerable vertical amplitude because the projection of the medial extremity, as viewed from above by the endoscope, will move with the propagation of the glottal wave. The present method is capable of providing valuable information on the absolute length of the mucosal wave in vivo. One notable advantage of the stereo-endoscopy method over other imaging modalities is that tracking of the medial edge is limited by the resolution of the camera sensor and not by the resolution of a projected dot, line, or grid of dots. This gives a more continuous tracking of the medial J. Acoust. Soc. Am., Vol. 136, No. 6, December 2014

edge position, without having to dynamically position a laser projection at the medial edge as it changes position through the phonatory cycle. Admittedly, for this proof of concept work, the precision of the stereo reconstruction method is considerably less than that of prior laser methods. The Z reconstruction error, although not explicitly quantified for in vivo measurement, is on the same order as the expected oscillation amplitudes of the vocal fold in the inferior-superior direction. VI. CLOSING REMARKS AND FUTURE WORK

The work completed herein serves as a proof of concept study and shows that viable information on the vertical motion of the vocal folds can be obtained from in vivo highspeed stereo-endoscope imagery obtained in a clinical setting. It has been demonstrated that the results are sensitive to calibration error in the 3-D reconstruction as well as shortcomings in the stereo-matching algorithm; analysis of the error behavior advocates that improved image resolution and quality would help to ameliorate these errors and would improve the results of the presented methodology. For this method to have clinical viability, more detailed in vitro calibration studies are necessary. Accurate 3-D surface measurement of a known vocal fold geometry, either synthetic or excised, to compare to stereo imagery with representative scene qualities and apparatus would allow for much more accurate quantification of the various error sources and the effect on the final reconstruction result. The methodology, concepts, and results presented suggest that further study in this regard is warranted and worth pursuing to extend this work to be used as a potentially valuable clinical tool. ACKNOWLEDGMENTS

This work was partially supported by the CGS-MSFSS exchange program of the Canadian Natural Sciences and Engineering Research Council (NSERC) and Grant-in-Aid for Scientific Research (No. 25540074, No. 23300071, No. 25861530) from Japan Society for the Promotion of Science (JSPS). 1

R. L. Miller, “Nature of the vocal cord wave,” J. Acoust. Soc. Am. 31, 667–677 (1959). T. Baer, A. L€ ofqvist, and N. S. McGarr, “Laryngeal vibrations: A comparison between high-speed filming and glottographic techniques,” J. Acoust. Soc. Am. 73(4), 1304–1308 (1983). 3 D. M. Bless, M. Hirano, and R. J. Feder “Videostroboscopic evaluation of the larynx,” Ear Nose Throat J. 66(7), 289–296 (1987). 4 H. Hirose, “High-speed digital imaging of vocal fold vibration,” Acta Otolaryngol. Suppl. 105(s458), 151–153 (1988). 5 I. R. Titze, The Myoelastic Aerodynamic Theory of Phonation (National Center for Voice and Speech, Iowa City, IA, 2006), 424 pp. 6 S. Adachi and J. Yu, “Two-dimensional model of vocal fold vibration for sound synthesis of voice and soprano singing,” J. Acoust. Soc. Am. 117(5), 3213–3224 (2005). 7 D. A. Berry, D. W. Montequin, and N. Tayama, “High-speed digital imaging of the medial surface of the vocal folds,” J. Acoust. Soc. Am. 110, 2539–2547 (2001). 8 M. D€ ollinger, D. A. Berry, and G. S. Berke, “Medial surface dynamics of an in vivo canine vocal fold during phonation,” J. Acoust. Soc. Am. 117, 3174–3183 (2005). 9 S. Tang, Y. Zhang, X. Qin, S. Wang, and M. Wana, “Measuring body layer vibration of vocal folds by high-frame-rate ultrasound synchronized 2


3299

with a modified electroglottograph,” J. Acoust. Soc. Am. 134(1), 528–538 (2013). 10 S. Saito, H. Fukuda, S. Kitahira, Y. Isogai, T. Tsuzuki, H. Muta, E. Takayama, T. Fujika, N. Kokawa, and K. Makino, “Pellet tracking in the vocal fold while phonating: Experimental study using canine larynges with muscle activity,” in Vocal Fold Physiology, edited by I. R. Titze and R. C. Scherer (Denver Center for the Performing Arts, Denver, CO, 1985), pp. 169–182. 11 H. Larsson and S. Hertega˚rd, “Calibration of high-speed imaging by laser triangulation,” Logoped. Phoniatr. Vocol. 29, 154–161 (2004). 12 A. Chan and L. Mongeau, “Vocal fold vibration measurements using laser Doppler vibrometry,” J. Acoust. Soc. Am. 133(3), 1667–1676 (2013). 13 M. Schuster, J. Lohscheller, P. Kummer, U. Eysholdt, and U. Hoppe, “Laser projection in high-speed glottography for high-precision measurements of laryngeal dimensions and dynamics,” Eur. Arch. Otorhinolaryngol. 262(6), 477–481 (2005). 14 T. Wurzbacher, I. Voigt, R. Schwarz, M. D€ ollinger, U. Hoppe, J. Penne, U. Eysholdt, and J. Lohscheller, “Calibration of laryngeal endoscopic high-speed image sequences by an automated detection of parallel laser line projections,” Med. Image Anal. 12(3), 300–317 (2008). 15 N. A. George, F. F. M. Mul, Q. Qiu, G. Rakhorst, and H. K. Schutte, “Depth-kymography: High-speed calibrated 3D imaging of human vocal fold vibration dynamics,” Phys. Med. Biol. 53, 2667–2675 (2008). 16 G. Luegmair, S. Kniesburges, M. Zimmermann, A. Sutor, U. Eysholdt, and M. D€ollinger, “Optical reconstruction of high-speed surface dynamics in an uncontrollable environment,” IEEE Trans. Med. Imag. 29(12), 1979–1991 (2010). 17 L. Yu, G. Liu, M. Rubinstein, A. Saidi, B. Wong, and Z. Chen, “Officebased dynamic imaging of vocal cords in awake patients with sweptsource optical coherence tomography,” J. Biomed. Opt. 14(6), 064020 (2009). 18 E. Chang, J. Kobler, and S. Yun, “Triggered optical coherence tomography for capturing rapid periodic motion,” Sci. Rep. 1, 48 (2011). 19 M. Sawashima and S. Miyazaki, “Stereo-fiberscopic measurement of the larynx: A preliminary experiment by use of ordinary laryngeal fiberscopes,” Ann. Bull. RILP 8, 7–10 (1974). 20 O. Fujimura, T. Baer, and S. Niimi, “A stereo-fiberscope with a magnetic interlens bridge for laryngeal observation,” J. Acoust. Soc. Am. 65, 478–480 (1979). 21 M. Sawashima, H. Hirose, K. Honda, H. Yoshioka, S. R. Hibi, N. Kawase, and M. Yamada, “Stereoendoscopic measurement of the laryngeal structure,” in Vocal Fold Physiology: Contemporary Research and

3300


Clinical Issues, edited by D. M. Bless and J. H. Abbs (College-Hill, San Diego, 1983), pp. 264–276. 22 I. T. Tokuda, M. Iwawaki, K.-I. Sakakibara, H. Imagawa, T. Nito, T. Yamasoba, and N. Tayama, “Reconstructing three-dimensional vocal fold movement via stereo matching,” Acoust. Sci. Tech. 34, 374–377 (2013). 23 S. Thomson, L. Mongeau, and S. Frankel, “Aerodynamic transfer of energy to the vocal folds,” J. Acoust. Soc. Am. 118(3), 1689–1700 (2005). 24 N. Henrich, C. d’Alessandro, B. Doval, and M. Castellengo, “Glottal open quotient in singing: Measurements and correlation with laryngeal mechanisms, vocal intensity, and fundamental frequency,” J. Acoust. Soc. Am. 117, 1417–1430 (2005). 25 R. Szeliski, Computer Vision: Algorithms and Applications (Springer, London, 2011), pp. 181–227, 467–539. 26 D. H. Ballard and C. M. Brown, Computer Vision (Prentice Hall, Englewood Cliffs, NJ, 1982), pp. 1–382. 27 D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense twoframe stereo correspondence algorithms,” Int. J. Comput. Vision 47, 7–42 (2002). 28 M. D. Levine, D. A. O’Handley, and G. M. Yagi, “Computer determination of depth maps,” Comput. Graph. Image Process. 2, 131–150 (1973). 29 T. Kanade and M. Okutomi, “A stereo matching algorithm with an adaptive window: Theory and experiment,” IEEE Trans. Pattern Anal. Machine Intell. 16(9), 920–932 (1994). 30 J. Wendler, “Stroboscopy,” J. Voice 6(2), 149–154 (1992). 31 M. Hirano and D. M. Bless, Videostroboscopic Examination of the Larynx (Singular, San Diego, CA, 1993). 32 J. Svec and H. Schutte, “Videokymography: High-speed line scanning of vocal fold vibration,” J. Voice 10, 201–205 (1996). 33 I. R. Titze, “The physics of small-amplitude oscillation of the vocal folds,” J. Acoust. Soc. Am. 83, 1536–1552 (1988). 34 R. Mittal, B. D. Erath, and M. W. Plesniak, “Fluid dynamics of human phonation and speech,” Ann. Rev. Fluid Mech. 45, 437–467 (2013). 35 Joint Committee for Guides in Metrology, “Evaluation of Measurement Data—Guide to the Expression of Uncertainty in Measurement,” Joint Committee for Guides in Metrology, Vol. 100, 2008. 36 H. Hollien, “On vocal registers,” J. Phonetics 2, 125–143 (1974). 37 A. Boessenecker, D. A. Berry, J. Lohscheller, U. Eysholdt, and M. Doellinger, “Mucosal wave properties of a human vocal fold,” Acta Acust. Acust. 93, 815–823 (2007). 38 M. Doellinger and D. A. Berry, “Visualization and quantification of the medial surface dynamics of an excised human vocal fold during phonation,” J. Voice 20, 401–413 (2006).


Copyright of Journal of the Acoustical Society of America is the property of American Institute of Physics and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

Vocal fold paresis accompanying vocal fold polyps.

Consideration of vocal fold position in unilateral vocal fold paralyses.

Bilateral vocal fold immobility.

Vocal fold stripping.

Comparison of voice outcomes after trial and long-term vocal fold augmentation in vocal fold atrophy.

An estimation of the population incidence of adult unilateral vocal fold mobility impairment in England.

Capturing three-dimensional in vivo lumbar intervertebral joint kinematics using dynamic stereo-X-ray imaging.

Vocal Fold Vibration in Vocal Fold Atrophy: Quantitative Analysis With High-Speed Digital Imaging.

Electroglottography and vocal fold physiology.

Paradoxic vocal fold movement disorder.

Classification for animal vocal fold surgery: resection margins impact histological outcomes of vocal fold injury.

Automatic recognizing of vocal fold disorders from glottis images.

Non-invasive in vivo measurement of the shear modulus of human vocal fold tissue.

Effect of superior laryngeal nerve on vocal fold function: an in vivo canine model.

Vocal fold fixation in laryngeal carcinomas.

In vivo investigation of acidified pepsin exposure to porcine vocal fold epithelia.

Videostroboscopy of human vocal fold paralysis.

Optimal disparity estimation in natural stereo images.

Endoscopic anterior-posterior cricoid split for pediatric bilateral vocal fold paralysis.

Quantification of Porcine Vocal Fold Geometry.

Permeability of canine vocal fold lamina propria.

Cytoskeleton of newborn vocal fold stellate cells.

Dynamic vocal fold parameters with changing adduction in ex-vivo hemilarynx experiments.

Vocal fold dynamics for frequency change.