Original Paper Received: December 20, 2013 Accepted: November 3, 2014 Published online: March 4, 2015

Caries Res 2015;49:226–235 DOI: 10.1159/000369831

Assessing Risk Factors for Dental Caries: A Statistical Modeling Approach Mario Trottini a Maurizio Bossù b Denise Corridore b Gaetano Ierardo b Valeria Luzzi b Matteo Saccucci b Antonella Polimeni b   

 

 

 

 

 

 

a Department of Statistics, University of Alicante, Spain; b Department of Oral and Maxillofacial Sciences, ‘Sapienza’ University of Rome, Italy  

Key Words Caries risk assessment · Risk indicators · Zero inflation · Hurdle models · Model selection · Correction for optimism

Abstract The problem of identifying potential determinants and predictors of dental caries is of key importance in caries research and it has received considerable attention in the scientific literature. From the methodological side, a broad range of statistical models is currently available to analyze dental caries indices (DMFT, dmfs, etc.). These models have been applied in several studies to investigate the impact of different risk factors on the cumulative severity of dental caries experience. However, in most of the cases (i) these studies focus on a very specific subset of risk factors; and (ii) in the statistical modeling only few candidate models are considered and model selection is at best only marginally addressed. As a result, our understanding of the robustness of the statistical inferences with respect to the choice of the model is very limited; the richness of the set of statistical models available for analysis in only marginally exploited; and inferences could be biased due the omission of potentially important confounding variables in the model’s specification. In this paper we argue that these limitations can be overcome considering a general class of candidate models and carefully exploring the model space using standard model selection criteria and measures of global fit and predictive performance of the candidate models. Strengths and limitations of

© 2015 S. Karger AG, Basel 0008–6568/15/0493–0226$39.50/0 E-Mail [email protected] www.karger.com/cre

the proposed approach are illustrated with a real data set. In our illustration the model space contains more than 2.6 million models, which require inferences to be adjusted for ‘op© 2015 S. Karger AG, Basel timism’.

Introduction

Dental caries is one of the most prevalent chronic childhood diseases worldwide and it is a major problem, both from a population health perspective and for the families that have to deal with young children who suffer toothaches. Current evidence shows that dental caries is a multifactorial disease complexly modulated by genetic, behavioral, social and environmental factors [Petersen et al., 2005; Ditmyer et al., 2010]. Considering the complex etiology of dental caries, a key issue is to identify the potential determinants and predictors of dental caries in order to target appropriate public health measures to prevent and control the disease and its negative consequences. The problem has received considerable attention in the scientific literature. From the methodological side, a broad range of statistical models is currently available to analyze dental caries indices (DMFT, dmfs, etc.) [see Todem, 2012, section 2]. These models have been applied in several studies to investigate the impact of different risk factors on the cumulative severity of dental caries experience. A large number of these studies have addressed the Mario Trottini Universidad de Alicante Carretera de San Vicente del Raspeig s/n ES–03690 San Vicente del Raspeig Alicante (Spain) E-Mail mario.trottini @ ua.es

Downloaded by: UCSF Library & CKM 169.230.243.252 - 4/14/2015 11:47:44 PM

 

dependence of DMFT/dmft on oral hygiene habits [Freeman et al., 1989; Stewart et al., 1991], and demographic variables [Borges et al., 2012]. Others have considered the importance of the fluoride component [Petersen et al., 2012], or the impact of risk factors specific of the oral environment of the patient [Grindefjord et al., 1993]. A substantial number of studies also considered nutrition-related variables [Bener et al., 2013], or psychosocial determinants [Eriksen et al., 1991]. However, in most of the cases: (i) these studies focus on a very specific subset of these risk factors, which increases the risk of bias in the statistical analysis due to the presence of confounding variables not included in the modeling strategy; (ii) in the statistical modeling only few candidate models are considered and model selection is at best only marginally addressed. As a result, our understanding of the robustness of the statistical inferences with respect to the choice of the model is very limited, and the richness of the set of statistical models available for analysis in only marginally exploited. In this paper we argue that these limitations can be overcome considering a very general class of candidate models (that arise combining the wide range of statistical models available in the literature with a suitable choice of caries determinants) and carefully exploring the model space using standard model selection criteria and a rich set of a measures of global fit and predictive performance of the candidate models. As an illustration we used data on 558 children between 2 and 9 years old from the province of Caserta, an area situated in the north of the region of Campania in the Southern of Italy. It is worth emphasising that the main area of interest of this paper was the statistical modeling of the data and not the sample chosen or the results from this sample (as it is mid-sized, localised sample of limited general interest). In fact, the statistical approach presented in this paper, can be applied to identify the potential determinants and predictors of dental caries in an arbitrary population that could be very different from the one considered in our illustration. Thus, for example, despite the fact that our data set consists of preschool and school children, the modeling strategy that we defend could be equally applied to a population of adolescents or adults. In our illustration, caries severity was measured via the International Caries Detection and Assessment System II (ICDAS II) as a sum of the Decayed, Missing or Filled Teeth Index in both the permanent dentition (DMFT) and in the deciduous teeth (dmft). We paid special attention to the choice of the best model to address the research question considering a to-

tal of more than 2.6 million models obtained combining six standard classes of models for caries data (Poisson, Negative Binomial and their extension provided by the Zero Inflated and Hurdle models) with different choices of the set of explanatory variables to be included in the model. These choices corresponded to all possible subsets of a rich set of risk factors which includes (i) socio-demographic attributes of the child and his/her parents; (ii) habits and perceptions that are potentially relevant for caries experience; (iii) premature delivery, and breastfeeding; and (iv) risk factors specific of the oral environment of the patient. The wide array of potential caries determinants included in the study allowed us to assess the impact of a large number of risk factors on DMFT properly taking into account and eliminating the effect of other confounding variables (within the limits of our study that being observational is not suited to assess causation). Model selection was performed using two standard procedures: the Akaike’s criterion (AIC) and the Schwartz criterion (BIC). Relative strengths of the best AIC and BIC models were addressed taking into account a very rich set of measures of global fit and predictive performance using bootstrap and cross validation as internal validation procedures to ‘correct for optimism’. The robustness of the statistical inferences with respect to the model choice was also addressed comparing the results of the best model by model type.

Assessing Risk Factors for Dental Caries: A Statistical Modeling Approach

Caries Res 2015;49:226–235 DOI: 10.1159/000369831

Sample In our illustration we used a sample of 558 children, with age (in years) ranging between 2 and 9, from the province of Caserta, an area situated in the north of the region of Campania in the Southern of Italy. The sample was obtained using the list of pediatricians that work in the province of Caserta and belong to the Italian Society of Preventive and Social Pediatric (SIPPS). According to the Italian system each child born in the Caserta province was randomly assigned to a pediatrician belonging to the Public Health Service (which roughly coincides with the SIPPS). Thus, the set of children covered by SIPPS provided a good approximation of the population of children in the Caserta’ area. Forty-five pediatricians (about 70% of the total) agreed to participate in the study. For each pediatrician a random sample (of size between 10 and 20) of children aged between 2 and 9 was recruited randomly for a routine dental check-up. The 558 children were examined at the pediatricians’ practices in the period between June 2009 and June 2010. The research was approved on February 2009 by the Department of Oral and Maxillofacial Sciences Board of the Sapienza University of Rome (protocol number 135/2009). Permission for clinical examination of the children was obtained by written informed consent and consent to the processing of personal data, both signed by the parents of the patients.

227

Downloaded by: UCSF Library & CKM 169.230.243.252 - 4/14/2015 11:47:44 PM

Material and Methods

Variables Definition and Variables Recoding The risk factors considered in the study comprise1: (i) sociodemographic attributes of the child (age, gender, parents education, parents age); (ii) habits and perceptions that are potentially relevant for caries experience (regular use of fluoride supplements, regular use of fluoridated water and toothpaste, number of between-meal snacks, frequency of brushing, child’s motivation, breathing habits; (iii) premature delivery, breast-feeding; and (iv) risk factors specific of the oral environment of the patient (minor salivary glands function, unstimulated saliva pH, consistency and buffering capacity of saliva, plaque pH and maturity). The first three groups of variables, (i)–(iii), were obtained from a face-to-face interview with the parents of the child using a questionnaire especially designed for this study. The risk factors specific of the oral environment of the patient, instead, were the results of clinical examination at the pediatrician practice. Saliva and plaque variables were obtained using the Saliva-check BUFFER, and the Plaque Indicator Kit tests. Caries severity was assessed using ICDAS II criteria [Ismail et al., 2007] and calibrated examiners according to the guidelines on training and calibration published by the BASCD [Pine et al., 1997]. The criterion for diagnosis of caries according to ICDAS was a score of 3 or greater. Decayed, missing, filled teeth indices both in the permanent dentition (DMFT) and deciduous teeth (dmft) were calculated as the number of teeth with caries lesions (incipient caries not included) and the number of teeth that has been extracted, or had fillings crowns. For each child we used as response variable the sum of the number of decayed, missing and filled teeth in the permanent and primary dentition. With somehow an abuse of notation in the rest of the paper we will denote the response variable by DMFT. Table 1 shows the distribution of the explanatory variables used in the study (with the corresponding levels when the variable is categorical). For continuous variables only summary statistics are reported. As shown in table 1, most of the children that participated in the study (about 92.3%) are between 4 and 8 years old. In order to simplify models specification both father’s and mother’s

1 In what follows the terms ‘risk factor’ and ‘explanatory variable’ will be used interchangeably.

228

Caries Res 2015;49:226–235 DOI: 10.1159/000369831

education variables where recodified by assigning to each person a value based on the years of schooling. The variable regular use of fluoride supplements takes into account the use of fluoride tablets by the mother during pregnancy; the use of fluoride tablets or drops by the growing child; the frequency of use of fluoride toothpaste; and application of fluoridated gel/varnish in professional dental practices. A second variable related to daily fluoride intake, regular use of fluoridated water and/or toothpaste was categorised with tree levels: ‘none’, when neither the water that the patient drinks at home nor the toothpaste used are fluoridated; ‘at least one’, when at least one of the two (water or toothpaste) are fluoridated; and ‘both’ when both are fluoridated. The number of between-meal snacks variable indicates the frequency of intakes of food, often not nutrients, between the main meals: breakfast, lunch and dinner. The priority that a patient assigns to oral health and the extent to which he/she is aware of his/her dental problems is assessed through the variable motivation, which has three levels: ‘high’, for patients self-motivated and aware of dental problems for which the dental health has a high priority; ‘medium’, for patients aware of dental problems but that need to increase motivation for their oral health; and ‘low’, for not motivated patients with low awareness of dental problems and for which the oral health has a low priority. To represent in a model a categorical risk factor with k levels, we use (k–1) binary variables. The first level of the categorical variable (according to codification in table 1) is taken as base-line category and for each of the remaining (k–1) levels we define a binary variable whose coefficient measure the differential effect of the level of the variable with respect to the base-line category. In table 1 we also report in parenthesis the actual sample size (n) by variable. As shown in the table, the number of missing values is quite heterogeneous across the risk factors considered in the study. We have complete information for age, gender and oral breathing, while the minimum sample size is for plaque maturity (n = 522). Statistical Modeling As observed in the recent literature on statistical modeling of caries data [see, for example, Preisser et al., 2012] as a result of the improvements in oral health populations, distributions of caries data are increasingly characterised by a number of zero observations in excess of what is expected under traditional count data models such as the Poisson (PI) and the Negative Binomial (NB) models. In our modeling of DMFT data, in addition to Poisson and Negative Binomial models, we consider two extensions of these models that can handle such ‘excess of zeros’: zero inflated models and Hurdle models [see Cameron and Trivedi, 2013 for an overview and Preisser et al., 2012 for discussion of important aspects of models’ interpretation]. This produces a total of six types of models: Poisson (PI), Negative Binomial (NB), Zero inflated Poisson (ZIP), Zero Inflated Negative Binomial (ZINB), Hurdle Poisson (HurdleP) and Hurdle Negative Binomial (HurdleNB). Each of the six types of models, for a given choice of the set of covariates, provides an assessment of the effects of risk factors on DMFT. Different models, however, provide, in general, different assessments. Some model selection procedure, thus, needs to be established. In this paper we considered two well-known model selection criteria, the Akaike’s information criterion (AIC) and the Bayesian Information criterion (BIC). Both provide a trade-off between model complexity and how well the model fit the data. Each of the two criteria can be expressed as the sum of a goodness-of-fit

Trottini/Bossù/Corridore/Ierardo/Luzzi/ Saccucci/Polimeni

Downloaded by: UCSF Library & CKM 169.230.243.252 - 4/14/2015 11:47:44 PM

The time elapsed since the study was performed, the localised and mid-sized nature of the sample used and the differences in the age of the children involved in the study were potential flaws in our dataset. We believe, however, that despite these limitations, the dataset that was used served its scopes, which are twofold: (i) to illustrate step by step how to implement the proposed statistical modeling strategy; and (ii) to highlight the benefits and limitations of the proposed strategy. In addition, it should be noted that: (a) the problem of difference in ages of the participants in the study can be (at least partially) addressed by including age as the explanatory variable in the modeling strategy (the results of the statistical analysis will be therefore adjusted for the confounding effect of age); and (b) in our opinion it is reasonable to assume that the target population (children between 2 and 9 in the Caserta province) has experimented very minor changes in the last four years and thus, time elapsed since the study was performed is probably a minor flaw with respect to the localised and mid-sized nature of the data.

Table 1. Distribution/summary statistics for the explanatory variables in the study

(n = 558) 1.6 5.7 17.3 20.9 17.4 15.8 20.9 0.4 (n = 558) 53.4 46.6 (n = 545) 5.1 33.2 47.8 13.9 (n = 543) 2.8 37.3 45.7 14.2 (n = 552) 21 32 36 39 53 (n = 546) 23 35 39 42 65 (n = 537) 47.3 52.7 (n = 548) 2.2 56.6 41.2 (n = 552) 37.7 32.4 29.9

Frequency of brushing, % 0 1 2 3 Motivation, % High Medium Low Oral breathing, % Yes No Minor salivary gland function, % 60 s (–); saliva_consistency_washy (–); motivation_low (+); saliva_glands_funct. >60 s (–); buffering_capacity_10–12 (–) saliva_consistency_washy (–); premature_delivery_NO (+).

ZINB

oral_breathing_NO (–); buffering_capacity_10–12 (–); Φ dispersion parameter

age (+); num_between_snaks2 (+); motivation_medium (+); motivation_low (+); saliva_glands_funct. >60 s (–); saliva_consistency_washy (–); premature_delivery_NO (+)

HurdleP

oral_breathing_NO (–); saliva_consistency_washy (–); buffering_capacity_10–12 (–)

age (+); num_between_snaks2 (+); motivation_medium (+); motivation_low (+); saliva_glands_funct. >60 s (–); saliva_consistency_washy (–); premature_delivery_NO (+)

HurdleNB

oral_breathing NO (–); saliva_glands_funct. >60 s (–); saliva_consistency_washy (–); buffering_capacity_10–12 (–); Φ dispersion parameter

age (+); num_between_snaks2 (+); motivation_medium (+); motivation_low (+); saliva_glands_funct. >60 s (–); saliva_consistency_washy (–); premature_delivery_NO (+)

Best BIC models POIS age (+); motivation_low (+) age (+); motivation_low (+) Count component

Zero component

ZIP

motivation_low (+); oral_breathing_NO (–)

age (+); motivation_medium (+); motivation_low (+)

ZINB

motivation_low (+); Φ dispersion parameter

age (+); motivation_medium (+); motivation_low (+)

HurdleP

motivation_low (+); oral_breathing_NO (–)

age (+); motivation_medium (+); motivation_low (+)

HurdleNB

motivation_low (+); Φ dispersion parameter

age (+); motivation_medium (+); motivation_low (+)

ZI and Hurdle models. The same conclusions do apply if we take into account model complexity and penalise models with a larger number of parameters. As shown in the table, in fact: (i) Negative Binomial type of models should be preferred to Poisson type of models both in terms of AIC and BIC (NB outperforms POIS; ZINB is superior to ZIP; and HurdleNB to HurdleP); and (ii) ZI and Hurdle models should be preferred to non-inflated models both in terms of AIC and BIC (ZIP and HurdleP outperform POIS; and ZINB and HurdleNB are superior to NB). In terms of model fit, the best and second best AIC (BIC) models, which are HurdleNB and ZINB models, respectively, are very similar and clearly outperform the

other models. These results are in agreement with the existing literature on statistical modeling of DMFT, where ZINB and HurdleNB models are usually the preferred models due to ‘excess’ of zero and ‘overdispersion’ that characterise dental caries data. According to tables 2 and 3, the identification of relevant risk factors for dental caries based on the AIC and BIC criteria are robust across the different types of models considered in this study. In particular, the four best AIC models (which in decreasing order are HurdleNB, ZINB, HurdleP and ZIP models) include the risk factors age, motivation, number of between snacks, oral breathing, minor salivary glands function, saliva consistency, buffering capacity, and premature delivery, which are

Assessing Risk Factors for Dental Caries: A Statistical Modeling Approach

Caries Res 2015;49:226–235 DOI: 10.1159/000369831

231

Downloaded by: UCSF Library & CKM 169.230.243.252 - 4/14/2015 11:47:44 PM

NB

232

Caries Res 2015;49:226–235 DOI: 10.1159/000369831

of fit measures in table 2, or by evaluating the predictive performance of the estimated models in table 2 using the original data since such an evaluation would be too ‘optimistic’: the same data would be used both to fit the model and for the evaluation of the model performance [see Steyerberg et al., 2001]. Some validation procedure is thus needed. In particular we used here two internal validation procedures, regular bootstrap and a variant of 5-fold cross-validation, which have been shown to outperform other procedures in similar settings [such as logistic regression as discussed in Steyerberg et al., 2001, section 2.4]. For the regular bootstrap procedure we draw with replacement B = 500 bootstrap samples of size n = 459 from our data set. Each candidate model, as estimated in the bootstrap sample, was evaluated in the bootstrap sample and in the original sample. The performance in the bootstrap sample represents an estimation of the apparent performance and the performance in the original sample represents the test performance. The average of the differences between the apparent and test performance across the B samples is an estimation of optimism that must be subtracted to the apparent performance in table  2. In the 5-fold cross-validation, we consider a random partition of the original data in 5 subsamples (each with approximately 20% of the observations in the original data). For each candidate model, in turn, four of the five subsamples are merged and used to estimate the model and the fifth subsample is used to evaluate the model performance. The random partition of the original data is repeated 100 times producing a total of 500 training samples (to estimate the model) and 500 test samples (to assess the model performance). The performance of each model is defined as the average performance of the model on the 500 independent test samples. Summary results for the comparison of the best AIC and BIC (HurdleNB) models are shown in tables 4 and 5. For each candidate model, table 4 presents the measures of goodness of fit and predictive performance adjusted for optimism using bootstrap and (the variant of) 5-fold cross-validation (that we label as 100 × 20% crossvalidation)2. As shown in the table the two internal validation procedures produce very similar results but for the Gilthorpe’s RMSE measure and the maximised loglikelihood (this discrepancy is expected since RMSE and 2

For an interpretation of the maximised log-likelihood (logLik) evaluated using cross-validation (last row, columns 3 and 6 in table 4) in terms of the Kullback-Leibler divergence between the truth and the model under consideration see Liu et al. [2012].

Trottini/Bossù/Corridore/Ierardo/Luzzi/ Saccucci/Polimeni

Downloaded by: UCSF Library & CKM 169.230.243.252 - 4/14/2015 11:47:44 PM

statistically significant (at the significant level α = 0.05) in at least one component of the model with the same sign when the variable is statistically significant in both components (see table  3). The third best AIC model, which is a HurdleP model, contains an extra regressor, unstimulated saliva pH, which is not statistically significant. Roughly speaking we could say that the four best AIC models suggest that, after adjusting for the confounding variable age, (low) motivation, (slow) minor salivary gland functions, (low) buffering capacity, saliva consistency, number of between snacks, oral breathing and (NO) premature delivering are significant risk factors for DMFT. As expected the BIC criteria yields more parsimonious models. The four best BIC models (which again are HurdleNB, ZINB, HurdleP and ZIP, respectively) all include the risk factors age and motivation, which are statistically significant (at the significant level α = 0.05) in at least one component of the model with the same sign when the variable is statistically significant in both components (see table 3). It should be noted that age and motivation are, in fact, statistically significant in all best AIC and BIC models in table 2. This is not surprising. As observed in the introduction age is an important confounding variable. Its inclusion in the selected models allows assessing the impact of the other caries determinants eliminating the effect of the differences in age of the participants in the study (that as we commented before is one of the potential flaws of our dataset). The third and fourth best BIC models (which are HurdleP and ZIP, respectively) contain an extra regressor, oral breathing, which is statistically significant in the count component of the model. Thus, according to the models selected by BIC, there is evidence in the data that (low) motivation (and to a less extent oral breathing) are risk factors for DMFT. As an additional model selection criterion to choose between the best AIC and BIC models we take into account their global fit (as measured by the maximised loglikelihood and Gilthorpe’s RMSE), and their predictive performance. For the latter we consider six scoring rules that provide summary measures of the predictive performance of the model. These are the logarithmic (logS), quadratic (QS), spherical (SphS), ranked-probability (RPS), Dawid-Sebastiani (DSS), and squared error (SES) scores. The six scoring rules are negative-oriented penalties that a forecaster wishes to minimise (i.e., the smaller the better). For a detailed definition of the six scoring rules and their interpretation see Czado et al. [2009]. Comparison of the best AIC and BIC models, however, cannot be addressed directly by using the goodness

Table 4. Comparison of the best AIC HurdleNB and the best BIC HurdleNB models in terms of several measures of model fit and pre-

dictive performance adjusted for optimismim using bootstrapping (with B = 500 bootstrap samples) and 100 × 20% cross-validation Best HurdelNB AIC apparent

bootstrap corrected

100 × 20% cross-validation

apparent

bootstrap corrected

100 × 20% cross-validation

1.207 –0.506 –0.680 0.759 1.713 3.641 11.683 –553.869

1.279 –0.488 –0.669 0.811 2.046 3.998 11.791 –587.184

1.291 –0.486 –0.668 0.819 2.093 4.074 30.114 –118.549

1.274 –0.490 –0.669 0.854 2.034 4.553 12.350 –584.589

1.294 –0.485 –0.666 0.866 2.175 4.618 12.529 –594.086

1.296 –0.485 –0.667 0.868 2.190 4.659 30.153 –119.010

Table 5. Comparison of observed versus ‘bootstrap-adjusted’ pre-

dicted counts for the best (HurdleNB) AIC and BIC models Observed data DMFT

n

0 1 2 3 4 5 6 7 8 9 10 11 12 13 >14

290 29 42 28 26 7 9 9 8 3 2 2 0 2 2

AIC, n

BIC, n

290.380 33.240 33.990 28.640 21.940 15.940 11.190 7.660 5.170 3.460 2.310 1.550 1.050 0.720 1.770

290.320 34.820 33.500 27.580 21.120 15.570 11.210 7.930 5.530 3.800 2.570 1.730 1.150 0.760 1.420

log-Lik depend on the sample size of the test data which is 459 for bootstrap and either 91 or 92 for 5-fold crossvalidation). Correction for optimism, in both cases, substantially reduces the ‘distance’ in the global fit and predictive performance between the best AIC and best BIC model (compare the difference between columns 1 and 4 with the corresponding differences between columns 2 and 5 or 3 and 6 in table 4). Although the best AIC model still slightly outperforms the best BIC with respect to all measures of global fit and predictive performance considered even after adjusting for optimism, the performance of the two models is very similar. As an additional indicator of global fit after adjusting for optimism, in Assessing Risk Factors for Dental Caries: A Statistical Modeling Approach

table 5 we compare the observed versus ‘bootstrap-adjusted’ predicted counts for the best AIC and BIC models3. Again the performance of the two models is very similar.

Discussion and Conclusions

The problem of identifying potential determinants and predictors of dental caries has received considerable attention in the scientific literature. Several classes of statistical models have been proposed to address this important research question. In most of the cases, however, the statistical approach used to identify potential determinants and predictors of dental caries has focused on a very limited subset of risk factors and a very small set of candidate models. As a result our understanding of the robustness of the statistical inferences with respect to the choice of the model is very limited; the richness of the set of statistical models available for analysis in only marginally exploited; and inferences could be biased due the omission of potentially important confounding variables in the model’s specification. In this paper we defend that these limitations can be overcome considering a very general class of candidate models and carefully exploring the model space using standard model selection criteria and a wide range of measures of global fit and predictive performance of the candidate models. We illustrated our ap-

3

Each candidate model as estimated in the bootstrap sample was used to obtain predicted counts for the original sample. These procedure was repeated 500 times (i.e., once for each of the boostrap samples). ‘Bootstrap-adjusted’ predicted counts were then obtained averaging predicted counts over the 500 repetitions.

Caries Res 2015;49:226–235 DOI: 10.1159/000369831

233

Downloaded by: UCSF Library & CKM 169.230.243.252 - 4/14/2015 11:47:44 PM

logS QS SphS RPS DSS SES RMSE logLik

Best HurdelNB BIC

234

Caries Res 2015;49:226–235 DOI: 10.1159/000369831

sider and exponentially with the number of risk factors considered in the study, that is (Size of the model space) = (number of types of models)·2(number of risk factors).  

In our illustration, for example, inclusion of an additional risk factor would double the size of the class of models (i.e., with 19 risk factors we would have more than 5.2 million models to search). Inclusion of 40 potential risk factors would produce 6·240 ≈ 6·1012 different models. The mere binary representation of such a model space would occupy more than 30 terabytes of memory. In addition, considering a large class of models also requires the adjustment of the corresponding results for ‘optimism’, which implies extra computation, as we did in our illustration. Modern computers, however, combined with an efficient programming should make feasible implementation of our approach for most applications of interest in dental caries studies. In addition, sampling procedures can be used to perform model selection when the model space is huge, exhaustive enumeration of all the models is unfeasible and inferences have to be based on a very small fraction of models visited [for a Bayesian approach to the problem see, for example, Garcia-Donato and Martinez-Beneito, 2013]. From the statistical modeling point of view, an additional problem that needs to be addressed (and which was not, however, the focus of this paper) is the missing value problem4. In our illustration overall the loss of children due to missing data was fairly high. Out of the 558 children that participated in the study 459 had complete data for the 18 risk factors considered in the modeling. The likelihood-based models that we used in modeling the DMFT are known to be robust against the missing at random mechanism [Todem, 2012]. This means that statistical analysis based on these models in order to be valid (when applied, as in our case to incomplete data), requires the ‘missingness’ to depend only on the observed outcomes. In order to check the missing at random hypothesis, we compared the group of (459) children with complete data and the group of (99) children with incomplete data with respect to DMFT index and each of the 18 risk factors used in the modeling strategy using a Chi square test for homogeneity [Fienberg, 1980]. At the significant level α = 0.05 we did not find statistically signifi4

For a general discussion of missing data mechanisms and their impact on statistical modeling see Little and Rubin [1987]. For a more specific discussion of the importance of problems and difficulties related to missing values in caries data see Todem [2012] section 4 and the references therein.

Trottini/Bossù/Corridore/Ierardo/Luzzi/ Saccucci/Polimeni

Downloaded by: UCSF Library & CKM 169.230.243.252 - 4/14/2015 11:47:44 PM

proach with a sample of children from the south of Italy. The sample used might not be of general interest (it is localised, mid-size and has additional flaws that we acknowledge in several places in the paper). We believe, however, that it serves its scopes which are twofold: (i) illustrate step by step how to implement the proposed statistical modeling strategy; and (ii) highlight the benefits and limitations of the proposed strategy. Important distinguishing features and (strengths) of our proposal are the very large size of the model space that we consider (and that exploits the richness of the statistical models and the broad range of risk factors discussed in the literature); and the combined used of model selection criteria, goodness of fit and predictive performance measures to perform model selection. These features improve existing statistical approaches to the problem of identifying determinants and predictors of dental caries in three important and interrelated ways, in particular, they (i) allow us to address the robustness of the results of our analysis with respect to the choice of the model, thus providing a more precise quantification of the uncertainty associated with the statistical inferences performed; (ii) provide a scientific framework to defend the use of a certain model as tool for inferences; (iii) avoid potential bias due to the omission of important risk factors in the specification of a model. In our illustration we consider more than 2.6 million models obtained combining six classes of models and 18 risk factors discussed in the literature. Model selection has been performed using the AIC and BIC criteria, and multiple assessments of the goodness of fit and predictive performance of the selected models. The robustness of the results with respect to the model choice has been addressed comparing the results of the best model by model type. According to our analysis (based on the best BIC model) low motivation (suitably adjusted for age) is a significant risk factor for DMFT. Other significant risk factors (according to the best AIC model) are slow minor salivary gland functions, low buffering capacity, saliva consistency, number of between snacks, and premature delivery. Once we correct for optimism, however, the inclusion of these additional risk factors only slightly improve the global fit and predictive performance of the model. These results are in agreement with the existing literature for caries data [see Ismail et al., 2011; Melvin, 1991; LlenaPuy, 2006; Evans et al., 2013, respectively], but for premature delivery. The main limitation of the proposed approach is its computational cost. The size of the model space increases linearly in the number of classes of models that we con-

cant differences between the two groups neither in terms of DMFT nor in terms of the 18 risk factors but for age which is observed for the whole sample. Our analysis supports the plausibility of the missing at random hypothesis and thus the validity of the inferences under the models used in our modeling strategy. Acknowledgements The authors would like to thank the pediatricians that participated in this study and offered their time and professionalism, in particular Dr. Petrazzuoli Giovanni and SIPPS’s President Dr. Di

Mauro Giuseppe. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Disclosure Statement For this study, there was no conflict of interest in: (1) the study design; (2) the collection, analysis and interpretation of data; (3) the writing of the report; and (4) the decision to submit the paper for publication.

References

Assessing Risk Factors for Dental Caries: A Statistical Modeling Approach

Gilthorpe MS, Frydenberg M, Cheng Y, Baelum V: Modelling count data with excessive zeros: the need for class prediction in zero-inflated models and the issue of data generation in choosing between zero-inflated and generic mixture models for dental caries data. Stat Med 2009;28:3539–3553. Grindefjord M, Dahllöf G, Ekström G, Höjer B, Modéer T: Caries prevalence in 2.5-year-old children. Caries Res 1993;27:505–510. Garcia-Donato G, Martinez-Beneito MA: On sampling strategies in Bayesian variable selection problems with large model spaces. J Am Stat Assoc 2013;108:340–352. Ismail AI, Sohn W, Tellez M, Amaya A, Sen A, Hasson H, Pitts NB: The International Caries Detection and Assessment System (ICDAS): an integrated system for measuring dental caries. Community Dent Oral Epidemiol 2007;35:170–178. Ismail AI, Ondersma S, Jedele JM, Little RJ, Lepkowski JM: Evaluation of a brief tailored motivational intervention to prevent early childhood caries. Community Dent Oral Epidemiol 2011;39:433–448. Liu H, Kronmal R, Chan KS: Semiparametric zero-inflated modeling in multi-ethnic study of atherosclerosis (MESA). Ann Appl Stat 2012; 6:1236–1255. Llena-Puy C: The role of saliva in maintaining oral health and as an aid to diagnosis. Med Oral Patol Oral Cir Bucal 2006;11:E449–E455. Little R, Rubin D: Statistical analysis with missing data. New York, John Wiley & Sons, 1987. Melvin JE: Saliva and dental diseases. Curr Opin Dent 1991;1:795–801. Petersen PE, Bourgeois D, Ogawa H, EstupinanDay S, Ndiaye C: The global burden of oral diseases and risks to oral health. Bull World Health Organ 2005;83:661–669.

Petersen PE, Baez RJ, Lennon MA: Communityoriented administration of fluoride for the prevention of dental caries: a summary of the current situation in Asia. Adv Dent Res 2012; 24:5–10. Pine CM, Pitts NB, Nugent ZJ: British Association for the Study of Community Dentistry (BASCD) guidance on the statistical aspects of training and calibration of examiners for surveys of child dental health. A BASCD coordinated dental epidemiology programme quality standard. Community Dent Health 1997;14(suppl 1):18–29. Preisser JS, Stamm JW, Long DL, Kincade ME: Review and recommendations for zero-inflated count regression modeling for dental caries indices in epidemiological studies. Caries Res 2012;13:413–423. Schwarz G: Estimating the dimension of a model. The Annals of Statistics, 1978, vol 6, pp 461– 464. Steyerberg EW, Harrell FE Jr, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD: Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol 2001; 54: 774–781. Stewart PW, Stamm JW: Classification tree prediction models for dental caries from clinical, microbiological, and interview data. J Dent Res 1991;70:1239–1251. Todem D: Statistical models for dental caries data; in Ming-Yu L (ed): Contemporary Approach to Dental Caries, 2012. http://www.intech open.com/books/contemporary-approachto-dental-caries/statistical-models-for dental caries-data. Zeileis A, Kleiber C, Jackman S: Regression Models for Count Data in R. Journal of Statistical Software, 2008, vol 27, pp 1–25.

Caries Res 2015;49:226–235 DOI: 10.1159/000369831

235

Downloaded by: UCSF Library & CKM 169.230.243.252 - 4/14/2015 11:47:44 PM

Akaike H: A new look at the statistical identification model, IEEE Trans Automatic Control 1974;19:716–723. Bener A, Al Darwish MS, Tewfik I, Hoffmann GF: The impact of dietary and lifestyle factors on the risk of dental caries among young children in Qatar. J Egypt Public Health Assoc 2013;88:67–73. Borges HC, Garbín CA, Saliba O, Saliba NA, Moimaz SA: Socio-behavioral factors influence prevalence and severity of dental caries in children with primary dentition. Braz Oral Res 2012;26:564–570. Cameron C, Trivedi A: Regression Analysis of Count Data, ed 2. Econometric Society Monograph No. 53. Cambridge University Press, 2013. Czado C, Gneiting T, Held L: Predictive model assessment for count data. Biometrics 2009; 65:1254–1261. Ditmyer M, Dounis G, Mobley C, Schwarz E: A case-control study of determinants for high and low dental caries prevalence in Nevada youth. BMC Oral Health 2010;10:24. Eriksen HM, Bjertness E: Concepts of health and disease and caries prediction: a literature review. Scand J Dent Res 1991;99:476–483. Evans EW, Hayes C, Palmer CA, Bermudez OI, Cohen SA, Must A: Dietary intake and severe early childhood caries in low-income, young children. J Acad Nutr Diet 2013; 113: 1057– 1061. Fienberg SE: The Analysis of Cross-Classified Data, ed 2. Cambridge, MA, MIT Press, 1980. Freeman L, Martin S, Rutenberg G, Shirejian P, Skarie M: Relationships between DEF, demographic and behavioral variables among multiracial preschool children. ASDC J Dent Child 1989;56:205–210.

Assessing risk factors for dental caries: a statistical modeling approach.

The problem of identifying potential determinants and predictors of dental caries is of key importance in caries research and it has received consider...
433KB Sizes 2 Downloads 11 Views