Comparison of oral surgery task performance in a virtual reality surgical simulator and an animal model using objective measures Ioanna Ioannou1,2 , Edmund Kazmierczak2 , Linda Stern2 Abstract— The use of virtual reality (VR) simulation for surgical training has gathered much interest in recent years. Despite increasing popularity and usage, limited work has been carried out in the use of automated objective measures to quantify the extent to which performance in a simulator resembles performance in the operating theatre, and the effects of simulator training on real world performance. To this end, we present a study exploring the effects of VR training on the performance of dentistry students learning a novel oral surgery task. We compare the performance of trainees in a VR simulator and in a physical setting involving ovine jaws, using a range of automated metrics derived by motion analysis. Our results suggest that simulator training improved the motion economy of trainees without adverse effects on task outcome. Comparison of surgical technique on the simulator with the ovine setting indicates that simulator technique is similar, but not identical to real world technique.

I. INTRODUCTION Traditional surgical skill training in dentistry and oral surgery involves practice on manikins, animals, extracted teeth, or live patients under expert supervision. This approach suffers from limitations such as lack of case variation and standardisation, lack of realism (in the case of manikins and animals), limited availability of expert supervision, and subjective assessment of surgical skill [3], [9]. These limitations have motivated the development of a variety of virtual reality (VR) simulators. VR simulators provide repeatable practice of surgical tasks on virtual patients with varied anatomies and pathologies [3]. The body of evidence thus far suggests that simulation-based training is at least as effective as traditional clinical education, though further validation is necessary in many surgical fields [7]. Despite the increasing availability and usage of a variety of VR simulators for dentistry and oral surgery [3], surprisingly little work has been conducted to evaluate whether the skills gained in these virtual environments transfer to real world situations. In the few studies of skill transfer that have been conducted, performance was evaluated using subjective measures such as trainee and instructor surveys [1]. The granularity and objectivity of human assessment are known to be limited [9] and the exact relationship between expert assessment and objective measures of performance has not been well studied [10] . The need for objective measures of surgical performance has been recognised for some time. It has been the focus 1 Department

of Computing and Information Systems, University of Melbourne, Victoria 3010, Australia [email protected],

[email protected], [email protected] 2 Department

of Otolaryngology, University of Melbourne, Victoria 3010,

Australia.

978-1-4244-9270-1/15/$31.00 ©2015 IEEE

of much work in other surgical fields, such as laparoscopic surgery, where a range of objective performance metrics have already been established and validated [8]. These measures include time on task, path length, number of motions, and motion smoothness. In the field of dentistry and oral surgery, the effort to establish such metrics is still in its infancy. Rhienmora et al. and Suebnukarn et al. introduced the use of objective assessment based on automated measures collected within a VR simulator [9], [11]. However, their work did not compare simulator metrics to their real world counterparts to determine the extent to which simulated surgical tasks resemble real world surgery. An important component of effective training simulations is content validity. If the objective of a simulator is to teach psycho-motor surgical skills, but the technique used within the simulator differs significantly from the technique used in the operating room, we cannot assume that trainees will acquire the correct skills. Experienced surgeons frequently comment that drilling in VR “feels” different, but little work has been done to understand how it differs [4]. This paper investigates how the practice of an oral surgery task within a VR simulator affects real world skills, and how objective performance measures within the simulator relate to corresponding measures in a real world setting. This approach constitutes a more objective validation of simulator skill transfer and provides detailed information that can be used to improve the simulation, by identifying discrepancies between the virtual and physical environments. II. METHODS Following approval by the University of Melbourne Human Research Ethics Committee, fourteen fourth year dentistry students with little or no previous bone drilling experience were recruited on a voluntary basis. The students were randomly assigned to the simulator group (nS = 8) or the control group (nC = 6). A. Surgical task The drilling task chosen for this study required the removal of bone to expose the root of a tooth without damaging the root itself. This task is a common part of training for wisdom tooth extraction surgery, where the root of the wisdom tooth is exposed to aid the extraction process. B. Platform In previous work we have studied the psycho-motor cues and drilling techniques used by novices and experts in carrying out the surgical task described above [5], [6]. These

5114

Fig. 1.

A partial screenshot is shown in the left panel, and a sample of the feedback provided to the trainee is shown on the right panel.

studies informed the development of a VR simulator for the practice of this task, using the open-source Forssim library as a base [2]. The system presents trainees with a 3D jaw model viewed through active stereo glasses, and uses a Sensable Phantom 1.5 High Force haptic device to simulate the touching and drilling of jaw and teeth. The Forssim base was modified with a new jaw model, featuring higher resolution and improved fidelity. Additionally, the system was modified to record a range of performance measures, such as time spent on task, path length, total number of drilling strokes, and stroke characteristics such as distance, speed, and shape. Real-time feedback was provided by means of a comparison to a pre-recorded expert performance. A second display showed the percentage of expert voxels (3D pixels) removed by the trainee and the amount of unnecessary voxels removed (figure 1). Trainees could also activate an overlay of the expert performance over their own performance, showing areas that they had missed.

was based on past expert performances on the simulator. The actions of participants (motion, force, voxels drilled) were recorded by the simulator software. Once participants reached the performance target, they performed the posttest on ovine jaws in the same manner as the control group. All participants completed a short questionnaire with demographic information. The outcome of the pre-test and post-test tasks was independently assessed by two expert oral surgeons blinded to the identity and grouping of participants. The assessors evaluated the drilled jaw specimens using the checklist shown in table I. This checklist was designed by the evaluators themselves and reflected the criteria by which they routinely assess students when teaching this task. On the simulator, the outcome of each task was assessed using voxel-based comparisons, such as the degree of overlap between participants’ performance and the expert recording, the amount of damage to the tooth, and the amount of unnecessary bone removed. TABLE I OVINE OUTCOME ASSESSMENT CHECKLIST

C. Experiment procedure Physical drilling was carried out on ovine jaws, which are a common dental training adjunct in Australia, and are considered a good approximation of human teeth and bone in terms of material hardness. All participants watched an instructional video demonstrating the desired technique and outcome. They then carried out the pre-test, which involved three repetitions of the task on separate teeth of an ovine jaw. To account for anatomical differences between the lingual and facial side of the jaw which may affect performance, participants were randomized to either drill teeth on the lingual or facial side during the pre-test and vice-versa during the post-test. Participants’ performances were recorded using an electromagnetic motion tracking system (Ascension TrakSTAR), with sensors attached to the drill and jawbone. Following the pre-test, the control group performed the posttest right away, which involved another three repetitions of the task on the opposite side of the jaw. The simulator group was given training on the simulator. Simulator training began with an introductory video and a short familiarization period. Following familiarisation, participants were instructed to repeat the drilling task until they were able to remove at least 90% of the bone material removed by an expert, while inflicting only minimal (< 30 voxels) damage to the target tooth. This benchmark

Criterion a1 : Sufficient bone removed a2 : Tooth damage a3 : Surrounding damage a4 : Bone left on tooth root CA: Combined score

Responses and point values Too little=0, Enough=1, Too much=0 None=2, Light=1, Severe=0 Yes=0, No=1 Yes=0, No=1 P4 CA = a 1 i

III. RESULTS Questionnaire data was analysed to determine whether the study cohort was balanced in terms of their prior experience. The results shown in table II suggest that the groups had a similar starting skill level.

5115

TABLE II PARTICIPANT D EMOGRAPHICS

Age Previous bone drilling experience # of times Previous task experience # of times

Control 23.0 (3.5) Y:33% N:66% 1.0 (1.1) Y:66% N:33% 0.3 (0.5)

Simulator 22.4 (1.3) Y:25% N:75% 1.0 (0.9) Y:62% N:38% 0.6 (1.1)

TABLE III B ETWEEN - GROUP COMPARISON OF PRE - TEST AND POST- TEST PERFORMANCE MEASURES . P- VALUES REPRESENT THE RESULTS OF M ANN -W HITNEY U TESTS COMPARING CHANGE BETWEEN GROUPS Metric Total time (s) Drilling time (s) Number of burr lifts Total strokes Mean stroke duration (s) Mean stroke distance (mm) Mean stroke speed (mm/s) % straight strokes % round strokes Drilling path length (mm) Total path length (mm) Combined assessment score

Sim pre-test 84.0 (20.2) 66.2 (22.3) 4.3 (2.2) 96 (53) 0.8 (0.2) 3.7 (1.1) 4.3 (1.3) 29.0 (7.2) 17.9 (8.1) 335 (140) 962 (574) 18.5 (5.8)

Sim post-test 62.3 (13.0) 55.7 (15.1) 2.6 (0.9) 80 (26) 0.7 (0.1) 4.0 (1.1) 4.4 (1.2) 24.2 (6.5) 28.1 (9.1) 288 (69) 458 (144) 19.2 (3.9)

Control pre-test 89.0 (36.7) 70.2 (37.6) 4.9 (2.2) 129 (91) 0.6 (0.1) 4.0 (0.8) 5.4 (0.8) 31.8 (7.9) 17.1 (5.9) 494 (354) 1056 (579) 20.7 (2.4)

Sim ∆ -21.7 (20.6) -10.5 (21.0) -1.8 (2.5) -16 (45) -0.1 (0.2) 0.2 (0.9) 0.1 (0.7) -4.7 (5.9) 10.2 (8.9) -45 (131) -504 (557) 0.8 (6.7)

Control ∆ -15.9 (15.4) -13.1 (18.9) -2.1 (1.5) -12 (44) -0.1 (0.1) -0.1 (0.6) 0.8 (0.5) 0.1 (6.7) -1.4 (7.4) -53 (145) -12 (284) -0.5 (3.8)

∆ p-value 0.302 0.897 0.518 0.897 0.897 0.699 . 0.053 0.197 * 0.014 0.897 * 0.039 0.476

TABLE IV W ITHIN - GROUP COMPARISON OF OVINE JAW AND SIMULATOR MEASURES FOR THE SIMULATOR GROUP USING PAIRED W ILCOXON R ANK S UM TESTS AND S PEARMAN CORRELATIONS

A. Effects of simulator training Motion data was analysed to derive a range of technique metrics These include standard measures such as time on task and burr tip path length, in addition to more detailed measures of drilling stroke characteristics, such as length and shape. To calculate these technique metrics, motion data was processed to filter out noise, adjust for variations in signal quality and annotate non-drilling periods. Finally a stroke detection algorithm was used to segment the trajectory into individual drilling strokes [12]. Straight strokes had displacement ≥ 0.85 and round strokes had distance σ(centroidDistance) ≤ 0.45. Each metric was averaged across the three pre-test and three post-test trials respectively. For outcome assessment, inter-rater agreement across all assessment criteria was estimated using Spearman correlation, and found to be 0.47. The numerical value of each criterion was summed into a combined score as shown in table I. The two assessor combined scores were summed for each trial, and pre-test and post-test trial scores were added to form a single pre-test and post-test score per participant. To determine whether simulator practice effected significant changes in the performance of the simulator group compared to the control group, the change in each metric from the pre-test to the post-test was compared across the two groups using a Mann-Whitney U test. The changes rather than the test measures themselves were chosen as the basis for comparison, to account for differences in starting skill level. Table III shows the means and standard deviations for each group, along with the observed change from pre-test to post-test, and Mann-Whitney U test p-value. Many of the metrics feature wide variability between participants, which hinders our ability to draw conclusions from this sample. However, the results indicate a statistically significant decrease (p=0.039) in the Total path length of the simulator group compared to the control group. Other measures of efficiency such as Total time and Total number of strokes show larger changes in the simulator group, but are not statistically significant. The average combined assessment score of simulator participants increased slightly while the average of the control participants decreased slightly, but the difference was not statistically significant.

Control post-test 73.2 (26.9) 57.2 (21.4) 2.8 (1.6) 117 (58) 0.5 (0.1) 3.9 (0.9) 6.3 (0.9) 32.0 (5.7) 15.7 (5.5) 441 (216) 1045 (774) 20.2 (4.8)

Metric Simulator Ovine jaws Total time (s) 149 (25) 62 (13) Drilling time (s) 64 (21) 56 (15) Number of burr lifts 30 (15) 3 (1) Total strokes 65 (35) 80 (26) Mean stroke duration (s) 1.1 (0.3) 0.7 (0.1) Mean stroke distance (mm) 3.7 (0.9) 4.0 (1.1) Mean stroke speed (mm/s) 3.6 (1.4) 4.4 (1.2) % straight strokes 48.0 (14.7) 24.2 (6.5) % round strokes 13.6 (9.2) 28.1 (9.1) Drilling path length (mm) 249 (202) 288 (69) Total path length (mm) 1187 (492) 458 (144) cor(ovine combined score, simulator % of expert removed) cor(ovine combined score, simulator tooth damage) cor(ovine combined score, simulator surrounding damage)

p-value * 0.012 0.208 * 0.012 0.123 * 0.036 0.484 0.161 * 0.017 * 0.017 0.208 * 0.012

Corr 0.62 0.40 -0.28 0.48 -0.05 0.17 0.38 0.38 0.29 0.43 0.74 -0.46 -0.46 -0.12

In drilling stroke characteristics, there was a significant increase in the % of round strokes employed by the simulator group (p=0.014) and a slight increase in the Stroke speed of the control group (p=0.053). B. Relationship between simulator and ovine measures Measures of technique within the simulator were calculated using the same algorithms that were used to compute the corresponding ovine jaw measures, thus enabling direct comparison of the two modalities. To investigate the extent to which performance in the simulator resembled performance on ovine jaws, we used paired Wilcoxon Rank Sum tests to compare each ovine post-test metric with the corresponding simulator metric, represented as the average of the last three simulator trials (table IV). Spearman correlations between these measures were also calculated, along with the correlation between the expert assessment score and the three simulator outcome measures. Table IV shows that trainees took more time to complete the task on the simulator compared to ovine jaws (p=0.012), with a longer Total path length (p=0.012) and more interruptions (Burr lifts, p=0.012). Despite large differences in absolute values, Total time and Total path length were fairly well correlated (0.62 and 0.74 respectively). Measures characterising the active drilling part of the task such as

5116

Drilling time, Drilling path length and Total strokes were not significantly different and showed moderate correlation (0.40, 0.43 and 0.48 respectively). Some stroke characteristics were similar in both environments while others differed. The biggest differences were in stroke shape measures, where participants utilised significantly more straight strokes in the simulator (p=0.017) and significantly more round strokes on ovine jaws (p=0.017). Stroke duration and distance were not correlated across the two modalities, while stroke speed and stroke shape showed weak to moderate correlation. Correlations between assessment scores on ovine jaws and outcome metrics on the simulator were mixed. More tooth damage in the simulator was moderately associated with lower assessment score on ovine jaws, as expected. However, the % of expert voxels removed in the simulator was negatively correlated with ovine assessment score, while damage to surrounding voxels was not correlated. IV. DISCUSSION A significant decrease in the path length of the simulator group compared to the control group indicates that simulator training may have improved the motion economy of trainees by reducing unnecessary movements. This is a positive outcome, as past work in this area has shown that experienced dentists utilise shorter path lengths compared to novices [9]. The control group employed slightly faster drilling strokes, but the increased speed did not result in improved efficiency or outcome. Faster movement may have been a result of performing all six repetitions of the drilling task in quick succession. Another difference between the groups was the use of more round strokes by the simulator group. Our past work [5] suggests that experts use long semi-circular strokes, but stroke roundness alone cannot determine whether this was the technique employed by the trainees in this study. Measures such as the angle between the burr and tooth surface could be combined with stroke shape and distance to detect this technique in future studies. The two groups did not differ significantly in terms of outcome score, which suggests that simulator practice improved economy of motion without adversely affecting task outcome. A lack of clear improvement in outcome assessments is consistent with previous evaluations of dental simulators [1]. Further work is required to study the effects of VR training using more objective measures of performance over a longer period and with larger sample sizes, to establish a more rigorous evidence base. Our comparison of ovine and simulator metrics yielded mixed results. The drilling path length, time spent drilling and total strokes were not significantly different, suggesting that the drilling task itself was similar in both settings. Metrics of efficiency were worse on the simulator, but this was expected due to participants pausing frequently to evaluate their performance against the built-in performance feedback. The majority of technique metrics showed moderate to strong correlation, indicating that individual participants performed similarly in both environments with respect to those metrics.

The most notable difference between the two environments was the use of more straight strokes on the simulator and more round strokes on ovine jaws. This difference is consistent with a previous study comparing stroke technique on a temporal bone simulator and cadavers [4]. The simulator in that study utilised a very similar drilling algorithm to the simulator in this study. A possible explanation is that the difference is due to limitations in the fidelity of drilling algorithm within the simulator, which enable the use of drilling technique that may be inefficient when applied in the real world. These limitations should be identified and addressed in future versions of the simulator. Finally, further work is required to establish the validity of objective outcome measures within VR simulators. Our results showed that it is difficult to identify clear relationships between automated simulator outcome measures and expert assessments in the physical world. A better understanding of these relationships is necessary if simulators are to be used for competency assessment. Objective metrics such as those presented here could be utilised in many areas of surgical training to gain a better understanding of the effects of training interventions and the relationship between virtual and real world performance. ACKNOWLEDGMENT Michael McCullough and Gregor Kennedy from the University of Melbourne assisted with the conduct and analysis of this study. Jonas Forsslund provided technical guidance. R EFERENCES R [1] M. M. Bakr, W. Massey, and H. Alexander. Evaluation of Simodont Haptic 3D virtual reality dental training simulator. Int J Dent Clinics, 5(4), 2013. [2] J. Forsslund, E.-L. Salln¨as, and K. J. Palmerius. A user-centered designed FOSS implementation of bone surgery simulations. World Haptics Conf, 391–392, 2009. [3] R. Gottlieb, J. M. Vervoorn, and J. Buchanan. Simulation in Dentistry and Oral Health. The Comprehensive Textbook of Healthcare Simulation, 329–340. Springer New York, 2013. [4] I. Ioannou, A. Avery, Y. Zhou et al. The effect of fidelity: How expert behavior changes in a virtual reality environment. Laryngoscope, 124(9):2144–2150, 2014. [5] I. Ioannou, E. Kazmierczak, L. Stern et al. Towards Defining Dental Drilling Competence, Part 1: A Study of Bone Drilling Technique. J Dent Educ, 74(9):931–940, 2010. [6] I, Ioannou, L. Stern, E. Kazmierczak et al. Towards Defining Dental Drilling Competence, Part 2: A Study of Cues and Factors in Bone Drilling. J Dent Educ, 74(9):941–950, 2010. [7] W. C. McGaghie, S. B. Issenberg, E. R. Cohen et al. Does simulationbased medical education with deliberate practice yield better results than traditional clinical education? A meta-analytic comparative review of the evidence. Acad Med, 86(6):706–711, 2011. [8] I. Oropesa, P. S´anchez-Gonz´alez, P. Lamata et al. Methods and tools for objective assessment of psychomotor skills in laparoscopic surgery. J Surg Res, 171(1):e81–e95, 2011. [9] P. Rhienmora, P. Haddawy, S. Suebnukarn, and M. N. Dailey. Intelligent dental training simulator with objective skill assessment and feedback. Artif Intell Med, 52(2):115–121, 2011. [10] D. Stefanidis. Optimal acquisition and assessment of proficiency on simulators in surgery. Surg Clin North Am, 90(3):475–489, 2010. [11] S. Suebnukarn, N. Phatthanasathiankul, S. Sombatweroje et al. Process and outcome measures of expert/novice performance on a haptic virtual reality system. J Dent, 37(9):658–665, 2009. [12] S. Wijewickrema, I. Ioannou, Y. Zhou et al. A temporal bone surgery simulator with real-time feedback for surgical training. Stud Health Technol Inform, 196:462–468, 2014.

5117

Comparison of oral surgery task performance in a virtual reality surgical simulator and an animal model using objective measures.

The use of virtual reality (VR) simulation for surgical training has gathered much interest in recent years. Despite increasing popularity and usage, ...
566B Sizes 0 Downloads 12 Views