Psychological Methods 2014, Vol. 19, No. 3, 428 – 443

© 2014 American Psychological Association 1082-989X/14/$12.00 http://dx.doi.org/10.1037/a0036850

How Bandwidth Selection Algorithms Impact Exploratory Data Analysis Using Kernel Density Estimation Jared K. Harpole and Carol M. Woods

Thomas L. Rodebaugh and Cheri A. Levinson

University of Kansas

Washington University in St. Louis

Eric J. Lenze This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Washington University in St. Louis School of Medicine Exploratory data analysis (EDA) can reveal important features of underlying distributions, and these features often have an impact on inferences and conclusions drawn from data. Graphical analysis is central to EDA, and graphical representations of distributions often benefit from smoothing. A viable method of estimating and graphing the underlying density in EDA is kernel density estimation (KDE). This article provides an introduction to KDE and examines alternative methods for specifying the smoothing bandwidth in terms of their ability to recover the true density. We also illustrate the comparison and use of KDE methods with 2 empirical examples. Simulations were carried out in which we compared 8 bandwidth selection methods (Sheather–Jones plug-in [SJDP], normal rule of thumb, Silverman’s rule of thumb, least squares cross-validation, biased cross-validation, and 3 adaptive kernel estimators) using 5 true density shapes (standard normal, positively skewed, bimodal, skewed bimodal, and standard lognormal) and 9 sample sizes (15, 25, 50, 75, 100, 250, 500, 1,000, 2,000). Results indicate that, overall, SJDP outperformed all methods. However, for smaller sample sizes (25 to 100) either biased cross-validation or Silverman’s rule of thumb was recommended, and for larger sample sizes the adaptive kernel estimator with SJDP was recommended. Information is provided about implementing the recommendations in the R computing language. Keywords: kernel density estimation, bandwidth selection, exploratory data analysis, graphical analysis, adaptive kernel density estimation

the normal, whereas a nonparametric approach makes no assumption about the distribution underlying the random sample. In practice, a nonparametric approach is preferable because the underlying density is typically unknown (Mugdadi & Jetter, 2010). A nonparametric technique commonly used is kernel density estimation (KDE). Kernel density estimation uses local averaging to create a smooth curve from a sample of observations. To estimate the density ˆf at a given point x, the local averaging occurs within a specified neighborhood of points close to x. The closer the points are to x, the more weight they are assigned. This increases the density at x, while points farther from x receive little to no weight (Wilcox, 2001). Although KDE estimates are not frequently reported in social science research, they have great utility as a tool for identifying outliers and unexpected patterns in the data not apparent from summary statistics (Behrens, 1997; Marmolejo-Ramos & Matsunaga, 2009). For example, Wilcox (2004, 2006) showed how KDE could be used to graphically depict effect sizes between groups and how using KDE can assist in finding group differences hidden by measures of central tendency. Recently, Wilcox, Erceg-Hurn, Clark, and Carlson (2013) described a technique of comparing two independent groups via upper and lower quantiles. KDE can be used as a way to help illustrate where the group differences lie in testing upper and/or lower quantiles, as described in Wilcox et al. (2013). Akiskal and Benazzi (2006) used KDE in conjunction with other measures to corroborate their conclusions that major depres-

Exploratory data analysis (EDA) is important, yet often overlooked in the social and behavioral sciences. The quality of the statistical conclusions depends on the accuracy of the data used in the analyses; in other words, “garbage in garbage out” (Kline 2008). EDA assists with hypothesis testing by revealing unexpected or misleading patterns in the data (Behrens, 1997). Unfortunately, many psychologists do not utilize EDA in their research. Central to EDA are graphics, which researchers use to diagnose potential issues in the data and observe trends that can be hidden by summary statistics. When scientists want to analyze and graph the underlying distribution of data, either parametric or nonparametric methods can be employed. A parametric approach assumes that the random sample comes from a known family of distributions such as

This article was published Online First June 2, 2014. Jared K. Harpole and Carol M. Woods, Department of Psychology, University of Kansas; Thomas L. Rodebaugh and Cheri A. Levinson, Department of Psychology, Washington University in St. Louis; Eric J. Lenze, Department of Psychiatry, Washington University in St. Louis School of Medicine. We thank the Center for Research Methods and Data Analysis, Kim Gibson, and Sunthud Pornprasertmanit for assistance and advice in preparing this article. Declaration of competing interests: none. Correspondence concerning this article should be addressed to Jared K. Harpole, 11300 West 117th Street, Overland Park, KS 66210. E-mail: [email protected] 428

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

HOW BANDWIDTH SELECTION IMPACTS DATA ANALYSIS

sive disorder and bipolar II disorder are not distinct but lie on a continuum. Additionally, Osberg and Smeeding (2006) mentioned that the value of KDE lies in presenting a picture that conveys much more information than summary statistics. There are three issues, however, of which practitioners must be aware when utilizing KDE. First, it is necessary to specify a kernel function to estimate the density. There are several common types of kernel estimators used in practice: normal, Epanechnikov, biweight, triweight, triangular, uniform, and cosine (Scott, 1992). Table 1 lists the equations for each kernel. There is no consensus about what type of kernel is best, but several authors have noted that the choice of the kernel is not particularly important because there is a trivial loss of efficiency in picking one kernel over the other (Scott, 1992; Silverman, 1986; Wand & Jones, 1995). In the present study the Epanechnikov kernel will be used, because it is the most efficient kernel albeit by a trivial amount (Wand & Jones, 1995, p. 31). The second issue involves selecting the bandwidth or smoothing parameter. There are two commonly used methods of choosing a smoothing parameter in practice, visually and automatically from the data. Visually selecting a bandwidth involves trial and error. This method can be very effective in determining the shape of the underlying distribution but is flawed. A trial and error approach can be time consuming and lacks objectivity (Wand & Jones, 1995). For many density estimation problems in the social and behavioral sciences, visually selecting the bandwidth would not be recommended. Automatic bandwidth selection methods use the data to generate a suitable bandwidth mechanically. The goal of data-driven bandwidth selection is to alleviate the subjectivity of visually selecting a bandwidth and quickly and accurately select an optimal smoothing parameter. This method has been researched extensively in mathematical statistics (Bowman & Azzalini, 1997; Cao, Cuevas, & Manteiga, 1994; Devroye, 1997; Jones, Marron, & Sheather, 1996; Park & Marron, 1990; Scott, 1992; Scott & Terrell, 1987; Sheather, 1992; Sheather & Jones, 1991; Silverman, 1986; Wand & Jones, 1995). However, very little research involving the impact of bandwidth selection methods on graphical illustrations exists in the social and behavioral sciences literature. The third issue involves density estimation with bounded variables. Bounded variables occur when observations are censored to some value(s). For example, Wilcox (2004, 2012) used a data set dealing with hangover symptoms after drinking in which negative values were implausible and showed that different KDE methods can produce impossible values. Variables with boundary restrictions can create difficulties for KDE if practitioners are unaware of how to deal with these issues. There are various methods for addressing bounded variables; however, some methods may cause the estimated density to become improper (Silverman, 1986; Simonoff, 1996; Wilcox, 2004). Wilcox (2004, 2012) proposed an adaptive kernel density estimate (AKDE) that has shown to suc-

429

cessfully deal with the boundary restriction problem. AKDE varies the amount of smoothing based on the density of points around a given point x. Further details about this method are provided later. The next section describes KDE and common bandwidth selection algorithms in more detail.

Background on KDE In KDE it is assumed we are given a sample of n identically and independently distributed (iid) observations X1, X2, . . . , Xn from which a density will be estimated. Let f (x) be the true density and ˆf (x; h) be the estimated true density. The kernel estimator is ˆ h) ⫽ f(x;

兺 K冉 nh n

1

x ⫺ Xi

i⫽1

h



,

(1)

兰 K共y兲dy ⫽ 1, 兰 yK共y兲dy ⫽ 0 and 兰y2K共y兲dy ⫽ ␮2 ⬎ 0, and h is the bandwidth or

where K is the kernel function that satisfies

smoothing parameter (Silverman, 1986; Wand & Jones, 1995). The ␮2 symbol depicts the second moment of a probability density function. Note that an unqualified 兰 symbol indicates integration over the entire real line. A more compact way to write Equation 1 is by letting Kh共u兲 ⫽ h⫺1K共u ⁄ h兲: ˆ h) ⫽ f(x;

1 n

n

兺 K (x ⫺ X ). h

(2)

i

i⫽1

To ensure that ˆf(x; h) is a density, the kernel should be a unimodal probability density function that is symmetric about zero (Mugdadi & Jetter, 2010; Scott, 1992; Silverman, 1986; Wand & Jones, 1995). In the present study the Epanechnikov kernel will be chosen for K defined as K(y) ⫽

3





1 1 ⫺ y2 ⁄ 4 5

兹5.

(3)

Kernel Density Error Criteria Suitable error criteria must be evaluated to examine the performance of various bandwidth selection methods. There are multiple error criteria that could be used, such as mean integrated square error (MISE), mean integrated absolute error (MIAE), mean uniform absolute error (MUAE), and mean hellinger distance (MHD; Cao et al., 1994; Mugdadi & Jetter, 2010; Wand & Jones, 1995). Most studies, however, have analyzed MISE because it is substantially easier to work with (Farmen & Marron, 1999; Jones et al., 1996; Marron & Wand, 1992; Mugdadi & Jetter, 2010; Park & Marron, 1990; Scott & Terrel, 1987; Sheather & Jones, 1991). Additionally, Farmen and Marron (1999) used the MISE to com-

Table 1 Some Kernels and Their Equations Kernel K(t)

Normal 1

兹2␲ Note.

e⫺共 2 兲t 1

Uniform 2

Epanechnikov

1

3

2

4

共1 ⫺ t2兲

Biweight 15 16

共1 ⫺ t2兲2

For the uniform, Epanechnikov, biweight, triweight, and cosine kernels | t | ⱕ 1, 0 otherwise.

Triweight 35 32

共1 ⫺ t2兲3

Triangle 1 ⫺ |t|

Cosine ␲ 4

cos

冉冊 ␲ 2

t

HARPOLE, WOODS, RODEBAUGH, LEVINSON, AND LENZE

430

pare several adaptive bandwidth selection methods and one fixed bandwidth selection method. The present study used the MISE for the error criterion in evaluating bandwidth selection performance. The discrepancy measure between f(x) and ˆf(x; h) at a specific point is typically the mean square error (MSE). The MSE can be expressed as ˆ ⫽ E[ f(x; ˆ h) ⫺ f(x)]2 , MSEx f(x)

(4)

second term represents the variance. If the smoothing parameter h is chosen to minimize the bias, then the resulting density will have a large variance and vice versa. The only parameter that is unknown in Equation 11 is R共f ⬙兲, which is a measure of the roughness or curvature of the density. The larger R共f ⬙兲 is, the larger the AMISE is and vice versa (Sheather, 2004). In the next section, six bandwidth selection algorithms are introduced that will be the focus of this study.

which can be stated in terms of squared bias and variance as

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

ˆ ⫽ [E f(x; ˆ h) ⫺ f(x)]2 ⫹ Var f(x; ˆ h) MSEx f(x)

(5)

(Wand & Jones, 1995, p. 14). Because Equation 5 only calculates the discrepancy at a single point, an error criterion is needed over the real line. The integral of MSE over the real line leads to the MISE, which is a global error criterion. The MISE is given as



ˆ h) ⫽ E [ f(x; ˆ h) ⫺ f(x)]2dx, MISE f(x;

(6)

which is equivalent to





ˆ h) ⫽ [E f(x; ˆ h) ⫺ f(x)]2dx ⫹ Var [ f(x; ˆ h)] dx. MISE f(x; (7) After some manipulation (see Jetter, 2005, pp. 6 –7; Wand & Jones, 1995, pp. 14 –16), the MISE can be expressed as





(nh)⫺1 K2h(x)dx ⫹ (1 ⫺ n⫺1)





(Kh * f)2(x)dx ⫺ 2 (Kh * f)(x)f(x)dx

⫹ f (x)dx. 2

(8)

Note that the ⴱ symbol represents the convolution of two functions

such that f ⴱ g ⫽ 兰 f共x ⫺ y兲g共y兲dy. In probability, convolution is used to multiply two probability density functions together. Throughout this paper convolution notation is used for parsimony. Assessing performance using the MISE is straightforward; however, its dependence on the bandwidth h is complex. To understand the relationship between bandwidth and MISE, we used an asymptotic approximation to the MISE called the asymptotic mean integrated squared error (AMISE). Taking the squared bias and variance from Equation 7, it can be shown by a Taylor series expansion that



1 Squared bias ⫽ biash(x)2dx ⬇ h4␮2R(f ⬙) 4

(9)



(10)

where R共K兲 ⫽ 兰K2共y兲dy (Wand & Jones, 1995, pp. 19 –22). By combining Equations 9 and 10, we obtain the formula for the AMISE: 1 AMISE ⫽ h4u22R(f ⬙) ⫹ n⫺1h⫺1R(K). 4

Normal rule of thumb. The normal rule of thumb (NROT) popularized by Silverman (1986) is well known and is implemented in most major software packages. It involves differentiating Equation 11 with respect to h and then setting the derivative equal to zero and solving for h. When this calculation is performed, the resulting equation is hAMISE ⫽



R(K) ␮2(K)2R(f ⬙ )



1⁄5

n⫺1 ⁄ 5 .

(12)

The only unknown value in Equation 12 is R共f ⬙兲, which must be estimated. NROT estimates f ⬙ by using a reference density, with the standard choice letting f ⬙ ⫽ ⌽␴2, the N(0, ␴2) density. When a normal kernel is used and f ⬙ ⫽ ⌽␴2 is substituted in Equation 12, the optimal bandwidth obtained is hNROT ⫽ 1.06␴n⫺1 ⁄ 5

(13)

(Silverman, 1986, p. 45). The idea behind NROT is that if the random sample is normally distributed, this selection method should accurately predict the optimal bandwidth. However, NROT has been shown to oversmooth densities that are multimodal and/or depart from normality (Cao et al., 1994; Jones et al., 1996; Scott, 1992; Silverman, 1986; Wand & Jones 1995). Silverman’s rule of thumb. Silverman (1986, pp. 47– 48) recommended reducing the 1.06 factor in Equation 13 to .9 to avoid missing bimodality and cope well with various unimodal densities. Furthermore, he recommended using the smaller of two scale estimates, the sample standard deviation and the sample interquartile range (IQR) divided by 1.34. Thus, this additional estimate is known as Silverman’s (1986) rule of thumb (SROT) and is defined as hSROT ⫽ .9 An⫺1 ⁄ 5, where A ⫽ min{sample standard deviation ␴ˆ , IQR ⁄ 1.34}. (14)

and ˆ ⬇ n⫺1h⫺1R(K), Variance ⫽ Var [ f(x)]dx

Bandwidth Selection Algorithms

(11)

The trade-off between bias and variance is illustrated by the terms in Equation 11. The first term represents the squared bias, and the

Silverman (1986, pp. 47– 48) conducted small simulations studies with a sample size of 100 and found that SROT performed well on densities with skewness or bimodality. Jones et al. (1996) also tested SROT for sample sizes of 100 and 1,000 on 15 densities and found that this selector had a high degree of bias for densities with many features. Least squares cross-validation. One of the best known bandwidth selection methods is least squares cross-validation (LSCV), or unbiased cross-validation (UCV). LSCV was first described in the context of density estimation by Rudemo (1982) and Bowman (1984). Given an estimate ˆf (x; h) of a density f(x), the integrated square error (ISE) can be expressed as

HOW BANDWIDTH SELECTION IMPACTS DATA ANALYSIS









1 ˜ BCV(h) ⫽ h4␮2(K)2R 共 f ⬙ 兲 ⫹ (nh)⫺1R(K) 4

ISE(h) ⫽ ( fˆh(x) ⫺ f(x))2dx ⫽ fˆ2h(x)dx ⫺ 2 fˆh(x)f(x)dx ⫹ f(x)2dx, (15) ˆ h兲 (Silverman, 1986). The last term in Equation 15 where fˆh共x兲 ⫽ f共x; does not depend on the bandwidth; thus, we ignore this term in the minimization process and focus on the first two terms in Equation 15. The ideal choice of the window width will minimize





This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

ˆ ⫽ fˆ2h(x)dx ⫺ 2 fˆh(x)f(x)dx. Z( f)

(16)

The general idea is to construct an estimate of Z(fˆ) from the data and minimize this estimate over the bandwidth h to determine the optimal bandwidth. To estimate f(x), let ˆf⫺i(x) be the density estimate with all the data points except xi, which indicates that

兺 K冉 n

fˆ⫺i(xi) ⫽ (n ⫺ 1)⫺1h⫺1

x ⫺ Xj h

j⫽i



.

(17)

From Equation 17 it can be seen that we are taking the average of



h⫺1

n j⫽iK共x

⫺ x j ⁄ h兲 from all the points except xi. This procedure

is repeated for all the remaining data points to estimate 2兰fˆh共x兲f共x兲dx. By combining Equation 17 with the first term in Equation 16, we obtain the LSCV function, to be minimized as



LSCV(h) ⫽ fˆ2(x)dx ⫺

2 n

n

兺 fˆ

⫺i(xi).

(18)

i⫽1

Finding the minimum of Equation 18 with respect to h minimizes the MISE (see Wand & Jones, 1995, p. 63). Although LSCV has been extensively used in practice, research and simulation studies showed that LSCV tended to undersmooth densities and had a high degree of sampling variability present. For example, Park and Marron (1990) found that LSCV performed poorly on almost all estimated densities compared with other bandwidth selectors due to the high amount of sampling variability present. These results were further validated by Cao et al. (1994) and Jones et al. (1996). Loader (1999), however, criticized these studies censuring the behavior of LSCV, emphasizing that they did not take into account the strengths and weaknesses of LSCV Loader found in particular that LSCV does well at recovering densities with many features. However, most densities in psychological data probably do not fit into this category. Biased cross-validation. Scott and Terrell (1987) created another cross-validation selection method to improve upon the shortfalls of LSCV. Known as biased cross-validation (BCV), this method is similar to LSCV and was created to lower the amount of sampling variability that was causing many of the problems mentioned previously. The difference between LSCV and BCV is that BCV is based on the formula for AMISE (see Equation 11), whereas LSCV is based on ISE (see Equation 15). However, with BCV we want to replace the unknown estimator R(f ⬙) by an estimator ˜ R 共 f ⬙ 兲 ⫽ R( fˆ ⬙) ⫺ (nh5)⫺1R(K ⬙),

(19)

where ˆf is replaced by ˆf(x; h) in Equation 1. Scott and Terrell (1987) show that plugging in R(fˆ ⬙) directly into Equation 15 produces a biased estimate by the amount 共nh5兲⫺1R共K⬙兲. This explains why this term is being subtracted out in Equation 19. Plugging Equation 19 in Equation 11 gives the BCV formula

431 (20)

(see Scott & Terrell, 1987 and Sheather, 2004, for details). According to Wand and Jones (1995), the attraction of BCV is that the asymptotic variance of the bandwidth is considerably lower than that of LSCV. This reduction in variance comes at a cost of BCV’s tendency to oversmooth a density. Several simulation studies indicated that BCV typically performs better than LSCV (Cao et al., 1994; Jones et al., 1996; Park & Marron, 1990). Plug-in methods. Plug-in methods are a popular approach to bandwidth selection where the unknown quantity R(f ⬙) in Equation 8 is replaced with an estimate. This method dates back to Woodroofe (1970) and Nadaraya (1974), who laid the theoretical groundwork, yet the issue of selecting an accurate estimate for R(f ⬙) was not addressed until later. Park and Marron (1990) created a plug-in rule that performed better than BCV and LSCV in asymptotic rate of convergence and a simulation study. Sheather and Jones (1991) further refined the plug-in rule created by Park and Marron (1990), creating the Sheather–Jones plug-in (SJDP), which has performed well in simulation studies, in asymptotic analyses, and on real data sets. To find the SJDP, use Equation 11 (which is the equation for the AMISE) and estimate R(f ⬙) by R(fˆg⬙), where g is a pilot bandwidth to be estimated. Next, solve Equation 11 for g, which is the pilot bandwidth that must be estimated. The SJDP approach writes g as a function of h, for the estimate R(fˆ ⬙)

冋 册

g(h) ⫽ C(K)

R共 f ⬙ 兲

1⁄7

R(f ⵮)

h5 ⁄ 7 ,

(21)

where C(K) is a constant. Then, it is necessary to estimate the unknown higher order functionals of f using kernel density estimates with the NROT method in place of g. The SJDP is given by estimating g(h) and substituting it into Equation 11, which gives

Table 2 Equations With Parameters for the Five Densities Density

Equation

Normal

N共0,1兲

Skewed

3 5

Bimodal

1 2

冉 冉 冊冊 冉 冉 冊冊 冉 冉 冊冊 冉 冉 冊冊 冉 冉 冊冊 冉 冉 冊冊

N 0,

9

2

1 1 2 ⫹ N ⫺ , 5 2 3

8

N ⫺1,

2

3

2

1 2 ⫹ N 1, 2 3

Skewed bimodal

1 3 1 N共0,1兲 ⫹ N , 4 4 2 3

Lognormal

LGN共0,1兲

3

2

1 13 5 ⫹ N ⫺ , 5 12 9

2

2

2

Note. Normal, skewed, bimodal, and skewed bimodal refer to normal mixture densities as specified in the “ks” R package. The formats of the equations for the normal mixtures are p1N共␮1,␴21兲 ⫹ · · · ⫹ pnN共␮n,␴2n兲, where p, ␮, and ␴2 are the mixture proportion, mean and standard deviation of the nth normal, respectively. LGN represents the abbreviation for the lognormal density with a mean of 0 and variance of 1.

HARPOLE, WOODS, RODEBAUGH, LEVINSON, AND LENZE

432



R(K) ␮2(K)2R( fˆ ⬙g(h))



adaptable to the local density of the data. In regions with a high density the bandwidth will be low and vice versa (Silverman, 1986; Wand & Jones, 1995). Estimating a density using an adaptive kernel requires a two-step procedure, with the first step requiring an initial or pilot estimate of the density and the second phase conducting the adaptive estimate. Let ˜f 共x 兲 be the

1⁄5

n⫺1 ⁄ 5 .

(22)

The SJDP is the smoothing parameter that is the solution to Equation 22 (Sheather, 2004). See Sheather and Jones (1991) for additional details. The SJDP has performed excellently in simulation studies and on real data sets (Cao et al., 1992; Jones et al., 1996; Salgado-Ugarte & Perez-Hernandez, 2003; Sheather, 1992; Sheather & Jones, 1991). Adaptive kernel methods. Up to this point all the bandwidth selection algorithms had a fixed smoothing parameter over the support of the density estimate. This constraint may be relaxed, which gives way to adaptive kernel density estimates. AKDEs evolved from the view that the degree of smoothing be

i

initial estimate of the true density f(x). The local smoothing factor is defined by

␭i ⫽

再 冎 ˜f (x ) i

⫺␣

,

g

(23)

and

0.3 0.1 0.0

0.0

0.1

p(x) 0.2

p(x) 0.2

0.3

0.4

Skewed

0.4

Standard Normal

-3

-2

-1

0 x

1

2

3

-3

-2

-1

0 x

1

2

3

Skewed Bimodal

0.1

0.1

p(x) 0.2 0.3

p(x) 0.2 0.3

0.4

0.4

Bimodal

0.0

0.0

-1

0 x

1

2

3

-3

Standard Lognormal 0.6

-2

0.0

-3

p(x) 0.2 0.4

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

h⫽

0

1

2

3 x

4

5

Figure 1. True density functions.

6

-2

-1

0 x

1

2

3

HOW BANDWIDTH SELECTION IMPACTS DATA ANALYSIS

log(g) ⫽

(24)

i

i⫽1

where ␣ is a sensitivity parameter satisfying 0 ⱕ ␣ ⱕ 1 and g is the geometric mean of ˜f 共xi兲. On the basis of recommendations from Silverman (1986) and Abramson (1982) ␣ was chosen to be .5 in this study. The AKDE is given by

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

˜f (t) ⫽ 1 n

兺 h␭ K再 n

i⫽1

1

(t ⫺ xi) h␭i

i



,

(25)

where K is taken to be the Epanechnikov kernel as recommended by Wilcox (2004, 2012). The bandwidth parameter h is taken to be h ⫽ 1.06 where

A ⫽ min(s, IQR ⁄ 1.34).

n

兺 logf˜(x ), n 1

A 1

n ⁄5

,

(26)

433 (27)

However, other algorithms, such as those previously mentioned in Equations 14, 18, 20, and 22, can be used to calculate h in the second phase. One decision practitioners must make is choosing a pilot estimate ˜f 共x 兲. One choice that has seen promise in practical applicai

tions is the expected frequency curve (Wilcox, 2004, 2012). The expected frequency curve is a slight variation of the naive kernel density estimator, as discussed in Silverman (1986). The general idea when estimating f(x) for a given value x is to use the values among X1, X2, . . . , Xn that are in a specified neighborhood around x. The first step is to compute the median absolute deviation (MAD) statistic, which is the sample median from n values | X1 ⫺ M | , . . . , | Xn ⫺ M | , where M is the sample median. A value Xi is said to be in the neighborhood of x if | x ⫺ Xi | ⱕ hMADN, with h being the bandwidth or smoothing parameter and MADN ⫽ MAD ⁄ .6745, as described in Wilcox (2004, pp. 337–

Figure 2. Boxplots for standard normal densities. The diamond in each box refers to the mean, and the line refers to the median of ln(MISE) for each method in a given sample size. LN ⫽ natural log; MISE ⫽ mean integrated square error; SJ ⫽ Sheather–Jones plug-in; NR ⫽ normal rule of thumb; SR ⫽ Silverman’s rule of thumb; LC ⫽ least squares cross-validation; BC ⫽ biased cross-validation; AK ⫽ adaptive kernel; AS ⫽ adaptive kernel with Silverman’s rule of thumb; AP ⫽ adaptive kernel with Sheather–Jones plug-in.

HARPOLE, WOODS, RODEBAUGH, LEVINSON, AND LENZE

434

338; Wilcox, 2012, pp. 47– 48). Further, Wilcox (2004, 2012) recommends choosing h to be .80 for general use. The value of .80 for h was used in the current study. The estimate of the expected frequency curve probability density function at x is given by

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

ˆ ⫽ f(x)

Nx 2hnMADN

,

where Nx is the number of observations in the neighborhood of x. The AKDE method by Wilcox (2004, 2012) was found to perform well in giving a reasonable good idea of the underlying density. Further, this method was shown to improve upon the issue of bounded variables that Wilcox (2004, 2012) and Bowman and Azzalini (1997) brought to light. Additionally, this method is ideally suited for dealing with skewed variables with a long tail, whereas fixed bandwidth methods may show spurious bumps (Silverman, 1986; Simonoff, 1996; Wand & Jones, 1995). However, we found no studies comparing the performance of the AKDE of Wilcox (2004, 2012) with respect to MISE.

The Current Study The goal of KDE as an EDA approach is to estimate the true underlying density as accurately as possible. Different bandwidth selection algorithms lead to density estimates with varying levels of accuracy. The present study compares eight bandwidth selection methods, the six methods reviewed above, and the AKDE with the second phase h determined by Equations 14 and 22. All bandwidth selection methods were assessed with the MISE to measure density recovery, with sample size and true density shape as independent variables. The three AKDE methods have not been studied for MISE performance previously. The study also illustrates the use of KDE with comparisons of the bandwidth selection methods and their use on two empirical examples. Prior simulations involving more than three bandwidths with multiple densities have primarily focused on sample sizes of 100 or more (Cao et al., 1994; Devroye, 1997; Jones et al., 1996). Here, nine different sample sizes will be used, including four that are less than 100, because smaller samples are common in the social

Figure 3. Boxplots for skewed density. The diamond in each box refers to the mean, and the line refers to the median of ln(MISE) for each method in a given sample size. LN ⫽ natural log; MISE ⫽ mean integrated square error; SJ ⫽ Sheather–Jones plug-in; NR ⫽ normal rule of thumb; SR ⫽ Silverman’s rule of thumb; LC ⫽ least squares cross-validation; BC ⫽ biased cross-validation; AK ⫽ adaptive kernel; AS ⫽ adaptive kernel with Silverman’s rule of thumb; AP ⫽ adaptive kernel with Sheather–Jones plug-in.

HOW BANDWIDTH SELECTION IMPACTS DATA ANALYSIS

sciences. In addition, previous studies have considered a strong positively skewed density similar to a lognormal distribution; however, these studies have not considered a moderately positively skewed density (Cao et al., 1994; Jones et al., 1996; Marron & Wand, 1992). Here, moderately positively skewed densities and lognormal densities are included among the true density shapes examined.

Method

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Simulation Method An R program (Version 3.0.1) generated the data, executed, and processed the output (R Core Team, 2013). The R code is freely available by request from the first author. The R program utilized the ks and stats packages to randomly generate data from a standard lognormal and four normal mixture densities: a standard normal, bimodal, positively skewed, and skewed bimodal (Duong, 2013). The equations for the densities are given in Table 2, and

435

graphical illustrations are given in Figure 1. These densities were chosen to give a general representation of the types of densities most often encountered in the social and behavioral sciences. For example, psychologists may study the number of drinks by college students or the number of sexual partners. These outcomes tend to be highly skewed, bounded, and reasonably characterized as lognormal (Schmitt, 2003; Wilcox, 2004, 2012). The positively skewed density is representative of psychopathologies such as anxiety and depression (Carleton, Asmundson, & Taylor, 2005; Van Dam & Earleywine, 2011). Both the bimodal and asymmetric bimodal densities can occur in stereotype research as well as intergroup relations (Fiske, Cuddy, Glick, & Xu, 2002; Van Boven & Thompson, 2003). To calculate the density for the fixed bandwidth selection methods (NROT, SROT, LSCV, BCV, and SJDP), we used the density function within the “stats” package of R. For the AKDEs, the akerd function from the “WRS” R package was used (Wilcox & Schönbrodt, 2013). To estimate the three AKDEs, we used the

Figure 4. Boxplots for bimodal density. The diamond in each box refers to the mean, and the line refers to the median of ln(MISE) for each method in a given sample size. LN ⫽ natural log; MISE ⫽ mean integrated square error; SJ ⫽ Sheather–Jones plug-in; NR ⫽ normal rule of thumb; SR ⫽ Silverman’s rule of thumb; LC ⫽ least squares cross-validation; BC ⫽ biased cross-validation; AK ⫽ adaptive kernel; AS ⫽ adaptive kernel with Silverman’s rule of thumb; AP ⫽ adaptive kernel with Sheather–Jones plug-in.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

436

HARPOLE, WOODS, RODEBAUGH, LEVINSON, AND LENZE

defaults in the akerd function and two modifications. The first AKDE was the default in the akerd function as described in section 3.06 and denoted AK. The other two AKDE estimates involved changing the second phase bandwidth selection methods for the defaults of AK to the SROT and the SJDP, denoted AKSR and AKSJ, respectively. These two methods were chosen as additional comparisons because they demonstrated the best performance of the fixed bandwidth methods in pilot studies conducted by the authors. The Epanechnikov kernel was used throughout all simulations. For each density and bandwidth selection method, nine sample sizes were evaluated: 15, 25, 50, 75, 100, 250, 500, 1,000, and 2,000. Sample sizes were chosen to cover a range typically encountered in psychology and to show how the bandwidth selectors behave with a large sample. The outcome of interest was the MISE, calculated by summing ˆ 2 over a grid of points. There were two different grids 共f共x兲 ⫺ f共x兲兲 in this study, one for the unbounded normal mixture densities and one for the bounded lognormal density. A pilot simulation was carried out to determine the upper and lower bounds for both grids.

For all five densities, a sample of 100,000 random draws and the 99.99% confidence interval (CI) were obtained; this was repeated 1,000 times, and the minimum lower bound and maximum upper bound of the CIs were computed. For the four normal mixture densities all minimum lower and maximum upper bounds for the 99.99% CIs were between ⫺4.64 and 4.64; thus, the grid bounds were ⫺5 to 5 for all four densities. The lognormal density 99.99% CI minimum and maximum values were .01 to 76.86, respectively. The grid bounds for the lognormal density were chosen to be 0 and 77. The number of grid points for all five densities was 512. This is the default value in the density function in R, and we had reason to believe that the MISE was not importantly influenced by the number of grid points based on a sensitivity analysis. For the lognormal distribution conditions, the mean MISE for the 512 grid points was compared to the mean MISE for 7,700 grid points chosen such that there were 100 grid points between each subsequent integer grid point. The results indicated that the 97.50

Figure 5. Boxplots for skewed bimodal density. The diamond in each box refers to the mean, and the line refers to the median of ln(MISE). LN ⫽ natural log; MISE ⫽ mean integrated square error; SJ ⫽ Sheather–Jones plug-in; NR ⫽ normal rule of thumb; SR ⫽ Silverman’s rule of thumb; LC ⫽ least squares cross-validation; BC ⫽ biased cross-validation; AK ⫽ adaptive kernel; AS ⫽ adaptive kernel with Silverman’s rule of thumb; AP ⫽ adaptive kernel with Sheather–Jones plug-in.

HOW BANDWIDTH SELECTION IMPACTS DATA ANALYSIS

quantiles of MISE differences were the same to four decimal points across all conditions. To provide more interpretable results, the natural log of the MISE was taken, with lower numbers of ln(MISE) indicating better performance. To provide a picture of how the different bandwidth selection methods behave, we constructed five boxplots with the nine sample sizes and the eight bandwidth selection methods on each.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Results Boxplots of the mean (diamond) and median (line) ln(MISE), averaged over the 1,000 replications, are presented in Figures 2– 6 for each true density. The abbreviations for each bandwidth selection method in the graph are as follows: SJDP ⫽ SJ, NROT ⫽ NR, SROT ⫽ SR, LSCV ⫽ LC, BCV ⫽ BC, AK ⫽ AK, AKSR ⫽ AS, AKSJ ⫽ AP. Figures 2– 6 are organized such that each of the nine cells represents a given sample size for the specific density studied.

437

In Figure 2, the standard normal density, BC performed better than all other methods in terms of lowest mean MISE for sample sizes below 250. For sample sizes of 250 and above, AK performed best. In Figure 3, the skewed normal density, BC had the best performance for sample sizes of 15, 25, and 50. For sample sizes of 75 to 2,000, the AK method performed best. In Figure 4, the bimodal density, SR performed best for sample sizes of 75 and less, AS performed best for a sample size of 100, and AP performed the best for sample sizes of 250 and greater. In Figure 5, the skewed bimodal density, BC performed best for the 15 sample size condition and SR performed best for sample sizes of 25, 50, 75, and 100. For sample sizes of 250 to 500 AP performed the best, and for sample sizes of 1,000 and 2,000 SJDP performed best. In Figure 6, the standard lognormal density, AP performed the best in all sample size conditions except 75, in which LSCV performed the best. For the larger sample sizes AP was a clear winner, as indicated by the lowest ln(MISE).

Figure 6. Boxplots for lognormal density. The diamond in each box refers to the mean, and the line refers to the median of ln(MISE) for each method in a given sample size. LN ⫽ natural log; MISE ⫽ mean integrated square error; SJ ⫽ Sheather–Jones plug-in; NR ⫽ normal rule of thumb; SR ⫽ Silverman’s rule of thumb; LC ⫽ least squares cross-validation; BC ⫽ biased cross-validation; AK ⫽ adaptive kernel; AS ⫽ adaptive kernel with Silverman’s rule of thumb; AP ⫽ adaptive kernel with Sheather–Jones plug-in.

HARPOLE, WOODS, RODEBAUGH, LEVINSON, AND LENZE

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

438

To help summarize and synthesize the results and provide practical conclusions, we devised a ranking system based on Cao et al. (1994). Although the ranking strategy was arbitrary, it was designed to provide practitioners with a general idea of relative performance of the bandwidth selection methods given the conditions of the present study. Each of the five bandwidth selection methods was given one total score reflecting how close it was on average to the theoretical MISE (bias), how variable the bandwidth was over replications (VAR), and the interaction between the bias and VAR over replications (MSE). The ideal bandwidth selection method should have a low bias, low variance, and low MSE. For each bandwidth selection method, the best bandwidth for a given sample size and true density was given eight points for the lowest outcome (best performance), seven points for second lowest, and so on to one point for the highest outcome (worst performance). A higher number of points corresponds to better performance. The aggregated results are as follows (scores are in parentheses): SJDP (727), SROT (678), BCV (653), AKSJ (647), NROT (620), AKSR (592), AK (580), and LSCV (363). As can be seen, the overall winner is the SJDP, followed by SROT, BCV, and AKSJ. However, these results do not tell the whole story because they are aggregated over sample size. The rankings are broken down by sample size in Tables 3, 4, and 5. One result that is consistent between both the aggregated results and the results in Tables 3–5 is that LSCV is last in almost every category. Tables 3–5 better convey how the optimal method depends on sample size. Looking at sample sizes of 15 to 100, BCV was the best in the smallest samples sizes (15 and 25) and SROT was the best method for sample sizes 50, 75, and 100. Once the sample size increases to 250, AKSJ surpasses SROT and performs best for sample sizes of 250 to 2,000. A final interesting result is that the adaptive kernel methods (AK, AS, AJ) all perform below the fixed bandwidth methods when the sample sizes are small (100 and under). For sample sizes of 250 and 500 the three adaptive methods take the top three rankings, albeit SJDP is tied with AK for 250 and slightly behind for 500. For sample sizes of 1,000 and 2,000 AKSR maintains the second spot and is passed by SJDP, respectively. When sample sizes are 25 to 100, BCV closely tails SJDP,

Table 3 Bandwidth Selection Rankings by Sample Size

Table 4 Bandwidth Selection Rankings by Sample Size Sample size 75

100

Method

Total

Method

Total

Method

Total

SROT SJDP NROT BCV AKSR AK AKSJ LSCV

86 84 83 72 60 57 57 41

SROT SJDP NROT BCV AKSR AK AKSJ LSCV

88 82 78 64 63 61 61 43

AKSJ AKSR AK SJDP SROT NROT BCV LSCV

85 74 73 73 72 65 52 46

Note. Methods are presented in ranked order from best to worst. SROT ⫽ Silverman’s rule of thumb; SJDP ⫽ Sheather–Jones plug-in; NROT ⫽ normal rule of thumb; BCV ⫽ biased cross-validation; AKSR ⫽ adaptive kernel with Silverman’s rule of thumb; AK ⫽ adaptive kernel; AKSJ ⫽ adaptive kernel with Sheather–Jones plug-in; LSCV ⫽ least squares crossvalidation.

and at a sample size of 250 it is several points behind both SROT and NROT. At a sample size of 500, BCV ties with NROT, and at 1,000 and 2,000 BCV surpasses both SROT and NROT.

Empirical Example: Clinical An empirical data set was analyzed to illustrate KDE and the bandwidth selection algorithms in small sample size scenarios. The data set comes from a study by Rodebaugh, Levinson, and Lenze (2013), comparing drug and placebo groups on a clinical assay for two outcome measures in a pretest/posttest design. The outcome variables were the scores on the Subjective Units of Distress Scale (SUDS; Wolpe, 1988) and the Social Phobia Scale (SPS; Mattick & Clark, 1998) for participants. The study consisted of 30 participants with social anxiety disorder as determined by clinical interview combined with a Liebowitz Social Anxiety Scale (Liebowitz, 1987) score of 30 or more, which is a reliable cut score for social

Table 5 Bandwidth Selection Rankings by Sample Size

Sample size 15

250

Sample size

25

50

500

1,000

2,000

Method

Total

Method

Total

Method

Total

Method

Total

Method

Total

Method

Total

BCV NROT SROT SJDP AK AKSR AKSJ LSCV

97 94 87 83 61 46 39 33

BCV NROT SJDP SROT AK AKSR AKSJ LSCV

92 87 87 82 59 50 42 41

SROT SJDP NROT BCV AKSR AK AKSJ LSCV

88 85 84 81 59 53 50 40

AKSJ AKSR AK SJDP SROT BCV NROT LSCV

100 78 73 72 64 55 53 45

AKSJ AKSR SJDP AK BCV SROT NROT LSCV

106 81 77 72 66 59 43 36

AKSJ SJDP AKSR BCV AK SROT LSCV NROT

107 84 81 74 71 52 38 33

Note. Methods are presented in ranked order from best to worst. BCV ⫽ biased cross-validation; NROT ⫽ normal rule of thumb; SROT ⫽ Silverman’s rule of thumb; SJDP ⫽ Sheather–Jones plug-in; AK ⫽ adaptive kernel; AKSR ⫽ adaptive kernel with Silverman’s rule of thumb; AKSJ ⫽ adaptive kernel with Sheather–Jones plug-in; LSCV ⫽ least squares crossvalidation.

Note. Methods are presented in ranked order from best to worst. AKSJ ⫽ adaptive kernel with Sheather–Jones plug-in; AKSR ⫽ adaptive kernel with Silverman’s rule of thumb; AK ⫽ adaptive kernel; SJDP ⫽ Sheather– Jones plug-in; SROT ⫽ Silverman’s rule of thumb; BCV ⫽ biased crossvalidation; NROT ⫽ normal rule of thumb; LSCV ⫽ least squares crossvalidation.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

HOW BANDWIDTH SELECTION IMPACTS DATA ANALYSIS

anxiety disorder (Mennin et al., 2002). There were 16 participants in the drug group and 14 participants in the placebo group. Sixtyseven percent of participants were women, and the average age was 43 years (range ⫽ 19 to 64). An overall multivariate analysis of covariance demonstrated a statistically significant treatment effect in this sample (Rodebaugh et al., 2013). However, in comparing the raw difference scores on the SUDS and SPS for the drug and placebo groups during follow-up tests, Rodebaugh et al. (2013) found an ambiguous result. They determined that the difference scores between drug and placebo for SPS were larger than difference scores for SUDS. This result was nebulous, because the treatment should be affecting the ratings relating to speech anxiety more than relating to general scrutiny. KDE is used here to better understand behavior of the difference scores for SPS and SUDS. Figures 7 and 8 present the results of using the eight bandwidth selection methods mentioned previously for the SPS and SUDS, respectively. Given the small sample sizes (drug group ⫽ 16 and placebo group ⫽ 14), special attention was given to the BCV and NROT because these methods performed best with smaller samples in the simulation study (see Table 3). As can be seen from Figure 7 (SPS), the drug group’s distribution is negatively skewed. All of the fixed bandwidth methods provide roughly the same result, but BCV and NROT both give a clear indication that the drug group’s distribution is only skewed to the right, with no extra bumps in the left tail. The adaptive estimates are similar but do not quite capture the full density. This issue with AKEs was raised by Wilcox (2012) when dealing with small sample sizes. In Figure 8 (SUDS) all methods indicate that the placebo group has a negatively skewed density. Of all the methods, the BCV gives the clearest indication of the negative skewness. The NROT also shows more

439

negative skewness than the other fixed bandwidth methods (SJDP, SROT, and LSCV), albeit not as distinct as BCV. Finally, notice the vast difference between the two sets of densities estimated using the LSCV (Figure 7 vs. 8). This difference illustrates that although the LSCV may have low bias it has high variance and is not recommended here. Because the treatment distribution is highly skewed, traditional parametric statistics are likely to be misleading. Two t tests showed that both SUDS and SPS difference scores were not statistically significant, albeit SUDS differences were less than SPS. Given the nonnormality of the drug distribution, it is prudent to use a robust estimator to test the differences for SUDS and SPS. The ‘pb2gen’ function in the ‘WRS’ R package was used to test the differences using M-estimators, which are robust to skewed distributions (see Wilcox, 2012, pp. 173–174 for details). The result indicated that SPS difference scores between drug and placebo were not significant, but SUDS difference scores were significant. These results are consistent with theory and were determined by looking at the two distributions of both groups and using an appropriate statistical method for the problem at hand.

Empirical Example: Sexual Partners The second empirical example deals with two questions from the 2012 U.S. General Social Survey (GSS; Smith, Marsden, & Hout, 2013) regarding sexual partners and illustrates how KDE and bandwidth selection methods perform on a larger data set. The research question involves the differences in reported number of heterosexual partners for both men and women. For men, the question on the GSS was “thinking about the time since your 18th birthday (again, including the recent past that you have already

Figure 7. Bandwidth selection methods for raw difference scores on the Social Phobia Scale (SPS). NROT ⫽ normal rule of thumb; SROT ⫽ Silverman’s rule of thumb; SJDP ⫽ Sheather–Jones plug-in; BCV ⫽ biased cross-validation; LSCV ⫽ least squares cross-validation; AK ⫽ adaptive kernel; AKSR ⫽ adaptive kernel with Silverman’s rule of thumb; AKSJ ⫽ adaptive kernel with Sheather–Jones plug-in.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

440

HARPOLE, WOODS, RODEBAUGH, LEVINSON, AND LENZE

Figure 8. Bandwidth selection methods for raw difference scores on the Subjective Units of Distress Scale (SUDS). NROT ⫽ normal rule of thumb; SROT ⫽ Silverman’s rule of thumb; SJDP ⫽ Sheather–Jones plug-in; BCV ⫽ biased cross-validation; LSCV ⫽ least squares cross-validation; AK ⫽ adaptive kernel; AKSR ⫽ adaptive kernel with Silverman’s rule of thumb; AKSJ ⫽ adaptive kernel with Sheather–Jones plug-in.

told us about) how many female partners have you ever had sex with?” For women, the question read the same as men but “females” replaced “males.” The total sample size analyzed was 1,974 people; 386 people were removed for missing or ambiguous responses on the items, 101 were removed for missing or ambiguous responses on age, and one outlier was removed because one individual indicated having greater than 989 sexual partners. The total sample size of the revised GSS was 1,486 persons with 55.60% women, and the mean age was 45.89 years (range ⫽ 18 to 88). These questions are obviously bounded on the low end at 0. The empirical question raised here is do men and women differ in their responses. Using the pb2gen function we find there is a difference between men and women with a 95% CI of ⫺4.13 to ⫺2.53, p ⬍ .001. This tells us that men have significantly more sexual partners on average than women. However, this statistic does not tell the entire story. KDE was used here to understand what the relationship looks like. Figure 9 illustrates the KDE of the group differences with all eight bandwidth selection methods mentioned previously. To better illustrate the most interesting features of the distributions the range for the number of sexual partners in the plots was restricted to100 or less. Results from the AKSJ method are favored because this method performed best with large samples in the simulation study. All eight bandwidth selection methods illustrate that the underlying distribution of male sexual partners has much thicker right tails than that for females. However, AKSJ does the best job of recapturing the underlying density, because numbers of sexual partners between 10 and 30 occurred more frequently than other numbers of partners (explaining some of the bumps). Furthermore, AKSJ

shows that the density of males is higher than the density of females close to zero. Other methods except SJDP and LSCV do not show this phenomenon, which is important because the number of sexual partners reported for males and females does not significantly differ at the lower quantiles of the distribution (below 25th quantile). However, SJDP and LSCV tended to have noisy right tails in the distribution, probably because these methods have a fixed bandwidth that does not adapt to local features of the data.

Discussion This study compared the performance of eight bandwidth selection methods with respect to density recovery using the MISE, for varying sample sizes and true density shapes. Simulations showed that, overall, the SJDP bandwidth selector performed best on the three outcome criteria (bias, variance, and MSE) for the five densities considered in this study. This result is consistent with previous research (Cao et al., 1994; Jones et al., 1996; Mugdadi & Jetter, 2010). Hazelton (2003) found that SJDP performed better than two adaptive bandwidth selection algorithms not considered here, different sample sizes, and some similar distributions. Also consistent with previous research was the result that LSCV performed poorly and is not recommended for general use (Cao et al., 1994; Jones et al., 1996; Scott & Terrell, 1987). When sample size is taken into account, the AKSJ method performed superior to other methods for sample sizes of 250 and above in the current study. Additionally, the AKSJ performed better than every method in the lognormal condition. For smaller sample sizes (25 to 100) all adaptive kernel methods did not perform as favorably as the fixed bandwidth methods. This is likely due to the increased

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

HOW BANDWIDTH SELECTION IMPACTS DATA ANALYSIS

441

Figure 9. Bandwidth selection methods for number of sexual partners. NROT ⫽ normal rule of thumb; SROT ⫽ Silverman’s rule of thumb; SJDP ⫽ Sheather–Jones plug-in; BCV ⫽ biased cross-validation; LSCV ⫽ least squares cross-validation; AK ⫽ adaptive kernel; AKSR ⫽ adaptive kernel with Silverman’s rule of thumb; AKSJ ⫽ adaptive kernel with Sheather–Jones plug-in.

complexity of adaptive bandwidth algorithms, which need larger sample sizes to see better performance. The performance of the bandwidth selection methods varied according to sample size. First, for sample sizes at or below 100, SROT was best for sample sizes of 100, 75, and 50 and BCV performed best for sample sizes of 25 and 15. The findings of SROT and BCV being superior at lower sample sizes is interesting because it shows that, for these five densities, SROT and BCV would be the recommended methods depending on the size of the sample, which has not been found before. At some point between a sample size of 100 and 250 AKSJ usurps SROT and becomes the best performer. It can be seen that AKSJ outperforms the other methods by a sizable amount, at larger sample sizes. However, AKSJ did not perform very well in smaller sample sizes, which explains why SJDP was the best performer overall. SJDP did not perform the best in any one sample size category but was close to the top in many conditions, explaining why it was recommended best overall. This finding is consistent with past research (Cao et al., 1994; Jones et al., 1996). As a general recommendation to practitioners, SJDP appears to be an excellent all around choice for selecting the appropriate bandwidth when estimating the underlying density. However, if you are dealing with small sample sizes, we recommend using BCV and/or SROT as methods alone or, even better, in conjunction with SJDP. For larger sample sizes, AKSJ would be the recommended bandwidth alone or in conjunction with SJDP. For dealing with bounded variables in general with fixed bandwidth selection algorithms in a program such as R or SAS, it is important to specify the boundary for which the software disperses the grid of points to calculate the density. Within R it

is necessary to specify the “from ⫽ min(data)” argument in the density function in order to tell R not to disperse the grid of points beyond the minimum value of the data set under consideration. If this is not specified the density will return illegal values, as shown in Wilcox (2004). This is consistent with the akerd function, which will use the minimum value to compute the density from. Overall, we recommend using the AKSJ bandwidth selection method for bounded variables. Additionally, AKSJ allowed the depiction of a potential lack of difference between male and female sexual partners below the 25th quantiles of the distribution. This illustration can help inform practitioners which quantiles they may need to test in the method described by Wilcox et al. (2013). The simulation results should be qualified by several limitations. First, results are specific to the five true densities. Although these densities were chosen to be representative of those frequently observed in psychological research, the underlying density of a random sample is typically unknown, and the present results may not generalize to all other density shapes that may be observed in practice. Second, although the choice of the kernel is trivial, results are specific to the Epanechnikov kernel. It is unknown how these results would differ if other kernels were used. Future research in this area could look at how other adaptive KDE algorithms perform on the five true densities used here and whether they outperform the methods considered in the current study. For instance Hazelton (2003) created an adaptive bandwidth selection method using cubic spline interpolation that could be compared to the methods in this study. It would also be interesting to compare the performance of the eight bandwidth selection methods with additional true density shapes in the future.

442

HARPOLE, WOODS, RODEBAUGH, LEVINSON, AND LENZE

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

References Abramson, I. S. (1982). On bandwidth variation in kernel density estimates—a square root law. Annals of Statistics, 10, 1217–1223. doi: 10.1214/aos/1176345986 Akiskal, H. S., & Benazzi, F. (2006). The DSM–IV and ICD-10 categories of recurrent [major] depressive and bipolar II disorders: Evidence that they lie on a dimensional spectrum. Journal of Affective Disorders, 92, 45–54. doi:10.1016/j.jad.2005.12.035 Behrens, J. T. (1997). Principles and procedures of exploratory data analysis. Psychological Methods, 2, 131–160. doi:10.1037/1082-989X.2.2 .131 Bowman, A. W. (1984). An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71, 353–360. doi:10.1093/ biomet/71.2.353 Bowman, A. W., & Azzalini, A. (1997). Applied smoothing techniques for data analysis: The kernel approach with S-Plus illustrations. New York, NY: Oxford University Press. Cao, R., Cuevas, A., & Manteiga, W. G. (1994). A comparative study of several smoothing methods in density estimation. Computational Statistics & Data Analysis, 17, 153–176. doi:10.1016/0167-9473(92)00066-Z Carleton, R. N., Asmundson, G. J. G., & Taylor, S. (2005). Fear of physical harm: Factor structure and psychometric properties of the Injury/Illness Sensitivity Index. Journal of Psychopathology and Behavioral Assessment, 27, 235–241. doi:10.1007/s10862-005-2403-y Devroye, L. (1997). Universal smoothing factor selection in density estimation: Theory and practice. Test, 6, 223–320. doi:10.1007/ BF02564701 Duong, T. (2013). ks: Kernel smoothing (R package Version 1.8.13) [Computer software]. Retrieved from http://CRAN.R-project.org/ package⫽ks Farmen, M., & Marron, J. S. (1999). An assessment of finite sample performance of adaptive methods in density estimation. Computational Statistics & Data Analysis, 30, 143–168. doi:10.1016/S01679473(98)00070-X Fiske, S. T., Cuddy, A. J. C., Glick, P., & Xu, J. (2002). A model of (often mixed) stereotype content: Competence and warmth respectively follow from perceived status and competition. Journal of Personality and Social Psychology, 82, 878 –902. doi:10.1037/0022-3514.82.6.878 Hazelton, M. L. (2003). Variable kernel density estimation. Australian & New Zealand Journal of Statistics, 45, 271–284. doi:10.1111/1467-842X .00283 Jetter, J. (2005). Comparison between the exact and the approximate bandwidths in kernel density estimation (Master’s thesis). Available from ProQuest Dissertations and Theses database. Jones, M. C., Marron, J. S., & Sheather, S. J. (1996). Progress in data-based bandwidth selection for kernel density estimation. Computational Statistics, 11, 337–381. Kline, R. B. (2008). Becoming a behavioral science researcher: A guide to producing research that matters. New York, NY: Guilford Press. Liebowitz, M. R. (1987). Social phobia. Modern Problems of Pharmacopsychiatry, 22, 141–173. Loader, C. R. (1999). Bandwidth selection: Classic or plug-in? Annals of Statistics, 27, 415– 438. doi:10.1214/aos/1018031201 Marmolejo-Ramos, F., & Matsunaga, M. (2009). Getting the most from your curves: Exploring and reporting data using informative graphical techniques. Tutorials in Quantitative Methods for Psychology, 5, 40 –50. Marron, J. S., & Wand, M. P. (1992). Exact mean integrated square error. Annals of Statistics, 20, 712–736. doi:10.1214/aos/1176348653 Mattick, R. P., & Clark, J. C. (1998). Development and validation of measures of social phobia scrutiny fear and social interaction anxiety. Behaviour Research and Therapy, 36, 455– 470. doi:10.1016/S00057967(97)10031-6 Mennin, D. S., Fresco, D. M., Heimberg, R. G., Schneier, F. R., Davies, S. O., & Liebowitz, M. R. (2002). Screening for social anxiety disorder

in the clinical setting: Using the Liebowitz Social Anxiety Scale. Journal of Anxiety Disorders, 16, 661– 673. doi:10.1016/S08876185(02)00134-2 Mugdadi, A.-R., & Jetter, J. (2010). A simulation study for the bandwidth selection in the kernel density estimation based on the exact and the asymptotic MISE. Pakistan Journal of Statistics, 26, 239 –265. Nadaraya, E. A. (1974). On the integral mean square error of some nonparametric estimates for the density function. Theory of Probability and Its Applications, 19, 133–141. doi:10.1137/1119010 Osberg, L., & Smeeding, T. (2006). “Fair” inequality? Attitudes toward pay differentials: The United States in comparative perspective. American Sociological Review, 71, 450 – 473. doi:10.1177/ 000312240607100305 Park, B. U., & Marron, J. S. (1990). Comparison of data-driven bandwidth selectors. Journal of the American Statistical Association, 85, 66 –72. doi:10.1080/01621459.1990.10475307 R Core Team. (2013). R: A language and environment for statistical computing. Available from http://www.R-project.org/ Rodebaugh, T. L., Levinson, C. A., & Lenze, E. J. (2013). A highthroughput clinical assay for testing drug facilitation of exposure therapy. Depression and Anxiety, 30, 631– 637. doi:10.1002/da.22047 Rudemo, M. (1982). Empirical choice of histogram and kernel density estimators. Scandinavian Journal of Statistics, 9, 65–78. Salgado-Ugarte, I. H., & Perez-Hernandez, M. A. (2003). Exploring the use of variable bandwidth kernel density estimators. Stata Journal, 3, 133–147. Schmitt, D. P. (2003). Universal sex differences in the desire for sexual variety: Tests from 52 nations, 6 continents, and 13 islands. Journal of Personality and Social Psychology, 85, 85–104. doi:10.1037/0022-3514 .85.1.85 Scott, D. W. (1992). Multivariate density estimation: Theory, practice, and visualization. New York, NY: Wiley. Scott, D. W., & Terrell, G. R. (1987). Biased and unbiased cross-validation in density estimation. Journal of the American Statistical Association, 82, 1131–1146. doi:10.1080/01621459.1987.10478550 Sheather, S. J. (1992). The performance of six popular bandwidth selection methods on some real data sets. Computational Statistics, 7, 225–281. Sheather, S. J. (2004). Density estimation. Statistical Science, 19, 588 – 597. doi:10.1214/088342304000000297 Sheather, S. J., & Jones, M. C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society, 53, 683– 690. Silverman, B. W. (1986). Density estimation for statistics and data analysis. Boca Raton, FL: CRC Press. Simonoff, J. S. (1996). Smoothing methods in statistics. New York, NY: Springer. Smith, T. W., Hout, M., & Marsden, P. V. (2013). General social survey, 1972–2012 cumulative file (ICPSR34802-v1) [Data file and codebook]. Chicago, IL: National Opinion Research Center. doi:10.3886/ ICPSR34802.v1 Van Boven, L., & Thompson, L. (2003). A look into the mind of the negotiator: Mental models in negotiation. Group Processes & Intergroup Relations, 6, 387– 404. doi:10.1177/13684302030064005 Van Dam, N. T., & Earleywine, M. (2011). Validation of the Center for Epidemiologic Studies Depression Scale—Revised (CESD–R): Pragmatic depression assessment in the general population. Psychiatry Research, 186, 128 –132. doi:10.1016/j.psychres.2010.08.018 Wand, M. P., & Jones, M. C. (1995). Kernel smoothing. London, United Kingdom: Chapman & Hall. Wilcox, R. R. (2001). Nonparametric estimation. In D. A. Belsley & E. J. Kontoghiorghes (Eds.), Handbook of computational econometrics (pp. 153–182). Hoboken, NJ: Wiley.

HOW BANDWIDTH SELECTION IMPACTS DATA ANALYSIS

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Wilcox, R. R. (2004). Kernel density estimators: An approach to understanding how groups differ. Understanding Statistics, 3, 333–348. doi: 10.1207/s15328031us0304_7 Wilcox, R. R. (2006). Graphical methods for assessing effect size: Some alternatives to Cohen’s d. Journal of Experimental Education, 74, 351– 367. doi:10.3200/JEXE.74.4.351-367 Wilcox, R. R. (2012). Introduction to robust estimation and hypothesis testing (3rd ed.). San Diego, CA: Academic Press. Wilcox, R. R., Erceg-Hurn, D., Clark, F., & Carlson, M. (2013). Comparing two independent groups via the lower and upper quantiles. Journal of Statistical Computation and Simulation, 84, 1543–1551. doi:10.1080/ 00949655.2012.754026

443

Wilcox, R. R., & Schönbrodt, F. D. (2013). The WRS package for robust statistics in R (Version 0.21) [Computer software]. Retrieved from http://r-forge.r-project.org/projects/wrs/ Wolpe, J. (1988). Subjective anxiety scale. In M. Hersen & A. S. Bellack (Eds.), Dictionary of behavioral assessment techniques (pp. 455– 457). New York, NY: Pergamon. Woodroofe, M. (1970). On choosing a delta-sequence. Annals of Mathematical Statistics, 41, 1665–1671. doi:10.1214/aoms/1177696810

Received December 19, 2012 Revision received October 17, 2013 Accepted January 16, 2014 䡲

General Call for Papers: Tutorials in Psychological Methods Psychological Methods invites you to submit a high-quality paper providing a tutorial of an advanced psychological method or a very clearly stated exposition of an innovative method. This Kind of paper is not often submitted although it tends to generate high interest and impact. Submissions should be new manuscripts that have not been published elsewhere. No Submission Deadline Details Submitted papers that involve the following are given preference: Methods of wide interest to a broad range of researchers in psychology Explicit guidelines, examples, computer code, and data sets An advanced methodological procedure that can be applied to several areas An innovative method that is clearly demonstrated to perform better than comparable methods Author(s) that have not published (very much if at all) previously in Psychological Methods Authors from underrepresented groups (e.g., early career, women, ethnic minorities) Guidelines Papers should follow the regular guidelines on our editorial policy, especially highlighting the options of tutorials, the teaching of quantitative methods, or reviews of statistical software. All submissions will undergo peer-review. Manuscripts should be submitted electronically through the Editorial Manager submission portal. Authors must indicate in their cover letter that their paper should be considered for this Call. For questions or further information please contact the Editor, Lisa Harlow ⬍[email protected]

How bandwidth selection algorithms impact exploratory data analysis using kernel density estimation.

Exploratory data analysis (EDA) can reveal important features of underlying distributions, and these features often have an impact on inferences and c...
2MB Sizes 0 Downloads 2 Views