AGGRESSIVE BEHAVIOR Volume 41, pages 1–13 (2015)

An Item Response Theory Analysis of the Olweus Bullying Scale Kyrre Breivik1* and Dan Olweus2 1

Regional Centre for Child and Adolescent Mental Health and Welfare, Uni Research Health, Norway Regional Centre for Child and Adolescent Mental Health and Welfare, Uni Research Health, Norway and University of Bergen, Norway

2

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. In the present article, we used IRT (graded response) modeling as a useful technology for a detailed and refined study of the psychometric properties of the various items of the Olweus Bullying scale and the scale itself. The sample consisted of a very large number of Norwegian 4th–10th grade students (n ¼ 48 926). The IRT analyses revealed that the scale was essentially unidimensional and had excellent reliability in the upper ranges of the latent bullying tendency trait, as intended and desired. Gender DIF effects were identified with regard to girls’ use of indirect bullying by social exclusion and boys’ use of physical bullying by hitting and kicking but these effects were small and worked in opposite directions, having negligible effects at the scale level. Also scale scores adjusted for DIF effects differed very little from non-adjusted scores. In conclusion, the empirical data were well characterized by the chosen IRT model and the Olweus Bullying scale was considered well suited for the conduct of fair and reliable comparisons involving different gender-age groups. Aggr. Behav. 41:1–13, 2015. © 2014 Wiley Periodicals, Inc.

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. Keywords: bullying; Olweus bullying scale; item response theory; different item; functioning; psychometric properties

INTRODUCTION

Numerous methods of assessment such as self-report questionnaires or interviews, peer nominations or ratings, parent or teacher reports and behavioral observations have been employed for the measurement of bully/victim problems in children and youth (see Jimersen, Espelage, & Swearer, 2010, for overviews). Although there may be strengths and weaknesses with any method, a well-constructed self-report questionnaire has many attractive psychometric and practical features, including easy administration to large populations of respondents. One such widely used instrument is the Olweus Bullying Questionnaire, the OBQ (Olweus, 1996; Solberg and Olweus, 2003) and the current paper reports a detailed psychometric analysis of the Bullying (sub) scale of that questionnaire. Solberg and Olweus (2003) reported various psychometric aspects of the two global variables measuring being bullied and bullying other students (in the past couple of months) in the Revised Version of the OBQ. Several empirical and conceptual analyses strongly attested to the functionality of the two selected variables in terms of construct validity and certain measurement properties. Internal consistency reliabilities (Cronbach’s © 2014 Wiley Periodicals, Inc.

alpha) for the Being Bullied scale or the Bullying scale (consisting of the 8–9 items concerning different forms of bullying) have been investigated in a number of samples from different countries and are in the .80-.90 range (Fekkes, Pijpers, & Verloove-Vanhorick, 2005; Felix and McMahon, 2006; Ferguson, San Miguel, & Hartley, 2009; Hartung, Little, Allen, & Page, 2011; Jenson, Dieterich, Brisson, Bender, & Powell, 2010; Kyriakides, Kaloyirou, & Lindsay, 2006; Stavrinides, Georgiou, & Theofanous, 2010; Strohmeier, Kärnä, & Salmivalli, 2011; Theriot, Dulmus, Sowers, & Johnson, 2005). As peer nominations and peer ratings have become relatively common techniques for the identification of bullies and victims, such measures have also been used to examine the concurrent/convergent validity of the OBQ. In a meta-analysis (Card, 2003) of the



Correspondence to: Kyrre Breivik, Uni Research, Postbox 7810, NO-5020 Bergen, Norway.E-mail: [email protected] Received 13 October 2011; Revised 6 October 2014; Accepted 7 October 2014 DOI: 10.1002/ab.21571 Published online 2 December 2014 in Wiley Online Library (wileyonlinelibrary.com).

2 Breivik and Olweus

relation between self reports and peer nominations/ ratings of victimization (several of which concerned the OBQ), the average correlation coefficient for the 21 studies was r ¼ .37. A methodologically “cleaner” result where all self-report variables came from the OBQ has been provided by Salmivalli (personal communication, March 16, 2009). In this sample comprising 17 600 students from grades 4 through 9 with modal ages of 10–16 years, the correlation between the average score of the different forms of victimization in the OBQ and the average of three similar peer rating variables (not nominations) amounted to r ¼ .40. For the global OBQ variable the correlation was r ¼ .39. Corresponding correlations between the average score for the different bullying forms and the global bullying others variable, respectively, and the average of three similar rating variables were r ¼ .34 and r ¼ .36. These results are relatively high given the fact that the average correlation between self and peer ratings of internalizing and externalizing problems has been found to be r ¼ .26 (Achenbach, McConaughy, & Howell, 1987). They are also an indication of the construct (convergent) validity of the key dimensions of the OBQ. Kyriakides et al. (2006) used Item Response Theory (IRT) modeling to provide a more detailed analysis of the two main scales of the OBQ (Being bullied and Bullying other students). The scales were analyzed for reliability, fit of the model, meaning and validity for a sample of 335 Greek Cypriot 11- and 12-year old students. It was generally concluded that the OBQ is a psychometrically sound instrument in terms of reliability and validity and that the results provide “further evidence of the international usefulness of the instrument” (Kyriakides et al., 2006; p. 797). Although most previous psychometric analyses have shown the OBQ to psychometrically sound, we here report a more refined analysis of the instrument using IRT technology. In contrast to the Rasch model used by Kyriakides et al. in which all items are assumed to have the same discrimination power, we used Samejima’s graded-response model (Samejima, 1969) where the discrimination parameters for the various items are allowed to vary freely. In addition, it is possible that some of the results in the Greek Cypriot study could have been sample-dependent and therefore we based our analyses on a much larger sample of students (n ¼ 48 926) from Norway, the country in which the questionnaire was originally developed. AN OVERVIEW OF IRT MODELING, RESEARCH AIMS

IRT is a set of advanced statistical models which aim to describe, “..in probabilistic terms, the relationship between a person’s response to a survey question and his Aggr. Behav.

or her level of the “latent variable” being measured by the scale” (Reeve and Fayers, 2005). IRT has many preferable qualities when compared to statistical analyses based on classical test theory (CTT). While CTT test statistics are highly sample and test dependent, the parameters derived from IRT are assumed to be sample and item invariant (within a linear transformation). The IRT invariance property facilitates the use of many advanced and useful psychometric tools such as computer adaptive testing, differential item functioning, and the linking of different items or assessment instruments on a common scale. In addition and in contrast to CTT, IRT does not assume that a scale is equally reliable across all score levels, but rather provides information about the precision level of a test at different score levels. Research has furthermore found that the use of a simple sum score derived from CTT can lead to scaling artifacts (e.g. spurious interaction effects) that are either avoided or strongly minimized if IRT trait scores are used instead (see Embretson and Reise, 2000; Bovaird, 2010). Assumptions of IRT Our first main aim of the present paper was to examine whether it is appropriate to use traditional IRT modelling on the Bullying scale. IRT models typically rest on four major assumptions: Monotonicity, local independence, unidimensionality, and a normally distributed latent trait or construct (Edwards, 2009). Monotonicity means that the probability of endorsement of item response categories increases with higher levels of the latent trait. Local independence implies that there should be (practically) no correlation between the items after the latent trait in the model is controlled for. For the traditional and most popular IRT models, this assumption usually translates into an assumption of unidimensionality. Even if it is common to distinguish between different bullying forms, for example, physical (e.g., hitting & pushing), verbal (e.g., name calling) and indirect/relational forms (e.g., social exclusion & spreading false rumours), we predicted that these forms are facets or manifestations of one dominant factor or bullying tendency trait that is essentially unidimensional (Hays, Morales, & Reise, 2000). Many traditional IRT models assume that the latent trait is normally distributed (Woods, 2006; Woods and Thissen, 2006; Edwards, 2009). Given the typical skewness of bullying variables (Solberg and Olweus, 2003; Solberg, Olweus, & Endresen, 2007), we examined whether parameter estimation might be biased if we used a traditional IRT model. In the present study, we therefore compared the IRT parameters of a traditional IRT model with an IRT model using an estimator which detects and takes possible non-normality into account.

Olweus Bullying Scale

Psychometric Properties A second main aim of the present research was to assess the psychometric properties of the bullying tendency scale using IRT technology. Estimation of item characteristic functions or curves (ICC) is of central importance to all Item Response Theory models. An ICC is a curve which shows to what extent the probability of endorsing an item is dependent on the respondent’s latent trait level (Morizot, Ainsworth, & Reise, 2007). Respondents with a higher trait value typically have a greater probability of endorsing any particular item than a respondent with a smaller value. Standard IRT models usually estimate two parameters, discrimination and difficulty or severity parameters, to describe the ICC for the different items. The discrimination parameter (a, alpha) is the slope of the ICC and its steepness reflects how well the item is related to the latent trait (u, theta). This parameter is similar to a factor loading and items with higher discrimination values are better able to discriminate between respondents above and below the “inflection point”, the point on the latent trait where the probability of endorsing the item is 50 percent. The previously mentioned study by Kyrikides et. al. (2006) assessed the OBQ using the rating scale model (Andrich, 1978), which is a polytomous (more than one response category) Rasch model. Rasch models have many desirable features, but often fail to fit adequately the original scale as they assume that the discrimination parameter is the same for all the items (Thissen and Orlando, 2001). Accordingly, we assumed that a less restricted IRT model, where items are allowed to vary also in their discrimination parameter, would fit the data better. The difficulty/severity parameters (betas, b1 and b2)—also called location or threshold parameters — indicate the location along the trait level where “an individual would have a 50% chance of endorsing a particular item” (Edwards, 2009) or response category. By examining these parameters it is possible to order the specific bullying forms in terms of severity. Based upon prevalence rates in previous research (e.g., Baldry and Farrington, 1999; Collins, McAleavy, & Adamson, 2004; Fekkes et al., 2005; Koo, Kwak, & Smith, 2008; Smith & Shu, 2000; Solberg et al., 2007; Wang, Iannotti, & Nansel, 2009), we predicted that bullying by calling other students mean names and teasing them in a hurtful way (q25) would be the least severe form of bullying. We furthermore predicted that bullying someone by stealing or damaging his or her belongings (q29) would be the most extreme form of bullying. We refrained from making specific hypotheses with regard to the ranking of the severity parameters of the other bullying items due to less consistent findings across studies.

3

In line with the monotonicity assumption, we posited that the frequency with which a particular bullying behavior is performed would be informative in determining the item’s severity position or threshold value on the latent trait. Accordingly, the response alternative “once or twice” or more often was clearly expected have a lower threshold value than “2 or 3 times a month” or more often. We also wanted to explore the reliability of the present scale within an IRT framework. As mentioned, previous analyses derived from classical test theory have indicated that the present scale has adequate to excellent reliability. A major advantage of IRT, however, is that it is able to assess how reliable the instrument is across different levels of the underlying latent trait. The result from such analyses can be very useful as it indicates where along the trait continuum the scale provides adequate information for its purposes (Hays et al., 2000). In line with the intentions in constructing the questionnaire (Olweus, 1996), we expected that the present scale would be most informative (reliable) and provide the most precise differentiation among individuals at rather severe levels of the latent trait This is intended since bullying is a relatively low-prevalent phenomenon and it is therefore not expected that the scale will be able to differentiate as well at lower levels of the trait. Such a result would be similar to the findings for other scales that measure low prevalence symptoms/behaviors or attitudes (cf. Reise and Waller, 2009). Differential Item Functioning An important task of IRT analysis is to examine whether the various items function in the same way, that is, have the same measurement properties with different subgroups within the studied population, such as boys and girls or younger and older respondents. It is possible that a construct or latent trait will manifest itself somewhat differently in various subgroups and/or that items have a different relation with this construct in different subgroups even if the model fits adequately (Reise, Ainsworth, & Haviland, 2005). If so, this would imply that respondents from different subgroups with the same latent trait level do not have the same probability of endorsing the item concerned and the item response would thus be likely to mean slightly different things in different subgroups. This phenomenon is termed differential item functioning (DIF). If a scale contains items with DIF (and they are not adjusted for), comparisons between subgroups may be “biased” or misleading (Teresi, 2006). In the present study, we explored whether there were any DIF effects tied to gender and age (primary versus lower secondary grades, named “younger” and “older” in what follows). We Aggr. Behav.

4 Breivik and Olweus

strongly suspected that some DIF effects would be found for gender. Research has repeatedly found boys to be considerably more likely to use direct, especially physical, aggression than girls (Archer, 2004; Card, Stucky, Sawalani, & Little, 2008; Olweus, 2010). Given the large gender difference in direct aggression, this suggests that indirect aggression is a more preferred mode of aggression expression among girls than boys (Olweus, 2010). We thus predicted that girls with a certain latent trait level would be somewhat more likely to endorse items assessing indirect bullying forms than boys with the same trait level, while the opposite would be the case for bullying by physical means. We had no hypotheses with regard to possible age DIF effects. Summing up, in the present article we used IRT (graded response) modelling as a presumably useful technology for a detailed and refined study of the psychometric properties of the various items of the Bullying scale from the OBQ, for the identification of possible DIF effects, their impact and possible adjustment, and for examination of the psychometric properties of the total Bullying scale itself. METHOD

Participants The cross-sectional data which form the basis of this study were collected in the context of a large-scale implementation of the Olweus Bullying Prevention Program, the OBPP, in Norwegian schools. The sample consisted of 350 elementary and lower secondary schools which took the Olweus Bullying Questionnaire three to four months before they were to implement the OBPP in their schools (Olweus and Limber, 2010). The schools constituted five cohorts of schools which took the measurement in the period from October 2001 to October 2003. The sample consisted of 48,926 students in grades 4 through 10, 24,958 boys and 23,968 girls, with modal ages between 10 and 16 years. Of the students, 33,796 were from the elementary level (grades 4 through 7) and 15,130 came from the lower secondary level (grades 8 through 10). The levels of bullying (and victimization) problems were similar to what has been found to characterize Norway at the national level (e.g., Craig et al., 2009; Olweus, 2010). Procedure In schools that are to implement the OBPP, a special project coordinator is appointed who, together with the specially trained Olweus instructor/trainer, has responsibility for the practical arrangements and administration of the OBQ. The students in the present study typically took the questionnaire in their own classrooms and gave their anonymous responses on a paper Aggr. Behav.

version of the questionnaire. When administered in the anonymous mode, students only provide information about their own classroom, grade, gender, and school. Detailed instructions on how to respond and an explanation or “definition” of what is meant by bullying were included in the questionnaire (see e.g., Olweus, 1996,2010 for more information). The students were strongly encouraged to give sincere answers and were told that their responses would be treated as confidential. The questionnaire data were (and are typically) used to generate a detailed report about various aspects of “the bullying situation” at each individual school and this information is used as an important background for the schools anti-bullying work with the OBPP. The Revised Olweus Bullying Questionnaire The Revised Olweus Bullying Questionnaire the OBQ, (Olweus, 1996, 2010) contains a “definition” or explanation of bullying which is designed to capture all three main elements of the definition of bullying: An assumed intention to inflict injury or discomfort upon the target (an awareness that the implicated behaviors are likely to be perceived as negative by the target), the somewhat repetitive nature of bullying, and the imbalance in power between the target and the perpetrator(s) (Olweus, 1993, 2010, 2013). After the definition and a general or global question about having been bullied in the past couple of months (taking all possible forms of bullying into account), the students are asked to respond to questions about eight specific forms of bullying they may have been exposed to in the same period. Since we do not think that it is appropriate to conceptualize being a victim of bullying as a trait characterizing the targeted student, an implicit assumption of IRT (e.g., Reise and Waller, 2009), we decided to use IRT analysis only for the parallel items about bullying other students (see e.g., Ybarra, Boyd, Korchmaros, & Oppenheim, 2012; for a view that is similar to our position). These items can easily be perceived as representing a (latent) bullying tendency trait. The exact formulations of the items are given in Table I. These various forms of bullying comprise direct physical (q27, q29, q30) and verbal (q25, q3, q32) bullying as well as more indirect or relational ways of bullying in the form of intentional social isolation (q26), having rumors spread, and manipulation of friendship relationships (q28). Some questions about cyber or digital bullying were added to the Questionnaire in 2005 and afterwards, but since the data for the present sample were collected prior to 2005, such items were not included in the version of the scales used in this study.

Olweus Bullying Scale

5

TABLE I. Descriptive Statistics and Item Response Theory parameter estimates Item

Mean

Standard deviation

Factor loading

Alpha

Beta1

Beta2

Question 25: I called another student(s) mean names and made fun of or teased him or her in a hurtful way Question 26: I kept him or her out of things on purpose, excluded him or her from my group of friends, or completely ignored him or her Question 27: I hit, kicked, pushed, and shoved him or her around, or locked him or her indoors Question 28: I spread false rumors about him or her to make others dislike him or her Question 29: I took money or other things from him or her or damaged his or her belongings Question 30: I threatened or forced him or her to do things he or she did not want to do Question 31: I bullied him or her with mean names or comments about his or her race or color Question 32: I bullied him or her with mean names, comments, or gestures with a sexual meaning

1.26

0.52

0.77

2.34

0.95

2.15

1.14

0.41

0.71

1.83

1.58

2.80

1.11

0.36

0.77

2.27

1.68

2.65

1.08

0.31

0.78

2.27

1.85

2.92

1.03

0.20

0.82

2.56

2.35

3.14

1.04

0.23

0.84

2.77

2.12

2.96

1.08

0.32

0.77

2.30

1.83

2.78

1.09

0.34

0.80

2.56

1.71

2.59

IRT Parameter Estimation

Assumptions of the IRT Model

IRT parameter estimation was performed with the Multilog 7.03 (Thissen, 2003) program using Samejima’s (1969) graded response model (GRM) for each of the eight items included in the present study. The GRM is a popular model for estimating ordered polytomous data. In this particular model, each item has one discrimination parameter and as many difficulty parameters (thresholds) as there are response categories minus one (Edwards, 2009). All items about the different forms of bullying other students have the following response alternatives, coded 1–5: “it hasn’t happened in the past couple of months” (1), “it has only happened once or twice” (2), “2 or 3 times a month” (3), “about once a week “(4), and “several times a week” (5). However, with relatively few students endorsing categories 4 and 5 (range 0.04% to 1,7%), we decided to collapse categories 3–5 into one category coded (3). Accordingly, our IRT analyses comprised three response alternatives and the following two thresholds: Category 1 versus category 2,3,4,5, that is (have) “not bullied” other students versus bullied other students “once or twice” or more often, and category 1,2 versus 3,4,5, that is bullied other students “once or twice” or less versus bullied other students “2 or 3 times a month” or more .

The unidimensional assumption of the IRT model was tested by use of explorative and confirmatory factor analysis with the Mplus 6.0 (Muthén and Muthén, 2010) program. The Robusted least square estimator (WLSMV) was used because of the skewness of the raw data. The monotonicity assumption was assessed plotting rest score graphs (see Hall, Reise, & Haviland, 2007) with the SPSS version. 17.0 program. The normality assumption of the latent trait distribution was assessed by use of Ramsey-Curve Item response theory modelling (RC-IRT) using the RCLOG version 2.0 computer program (Woods, 2006; Woods and Thissen, 2006). Differential Item Functioning Prior to examining for potential DIF effects, it is important to find an appropriate set of (non-DIF) anchor items which serves to match individuals from the two groups being compared on the underlying trait. Anchor items are items for which there are no clear indications of DIF effects and which can thus be assumed to function similarly for the two groups compared (e.g., boys and girls or younger and older Aggr. Behav.

6 Breivik and Olweus

students). As recommended by several researchers (e.g., Navas-Ara and Gómez-Benito, 2002; Teresi, 2006), an iterative purification process was used to identify anchor items where each item was examined for potential DIF by using all the other items as a temporary anchor set. For each item, the fit of a model where the studied items parameters (alpha and beta values) were constrained to be equal for the relevant subgroups, was compared with a model where the parameters were left to be free. We judged an item as potentially problematic if there was a statistically significant difference in Chi square between the constrained and non-constrained models. As the Chi square test is highly sensitive to sample size (Borsboom, 2006), we decided that there also had to be a minimum 0.25 standard deviation difference between the groups in at least one of the two severity parameters (the beta values; Uebelacker, Strong, Weinstock, & Miller, 2009; Weinstock, Strong, Uebelacker, & Miller, 2009). When the size of the discrimination parameter (alpha) was significantly different between the subgroups, we visually examined the OCC (Operating Characteristic Curve) to determine whether the nonuniform DIF effect was non-negligible (cf. Steinberg and Thissen, 2006). After all the items had been tested as possible anchor items, potentially problematic items were removed, and the remaining items were reanalyzed. This purification process was repeated until all of the remaining items displayed minimal DIF and was thus judged as suitable as an anchor set by the specified criteria. Potential DIF effects tied to age and gender were explored. We first tested whether there were any DIF effects for age (younger, grades 4–7 versus older, grades 8–10). The younger group served as the reference group. As we wanted to take age group DIF effects into account while testing for gender DIF (with boys as the reference group), we created a new data set which included appropriate sub-items that took age DIF into account (see Flora, Curran, Hussong, & Edwards, 2008). The scale was evaluated for gender and age group DIF by a model-based likelihood ratio test approach using the freeware program IRTLRDIF (Thissen, 2001). The impact of potential DIF effects at the scale level was assessed by use of the total test response functions using both the non-corrected and the DIF-corrected expected scores, for the four different groups: younger boys, younger girls, older boys and, older girls (see Fig. 4). Total test functions are simply the sum of all the item response functions for the items included in the scale. We also examined the impact of the potential DIF effects on the group means by calculating IRT trait scores with or without DIF-correction. Aggr. Behav.

RESULTS

Model Assumptions All of the analyses for testing essential unidimensionality were supported. The exploratory factor analysis indicated that the data were well characterized by a single dominant factor. The ratio of the first to second principal component was 9.09 (eigenvalue 5.238 versus 0.576). The one- factor solution fitted well in a Categorical Confirmatory Factor Analysis (CFI ¼ 0.99, RMSEA ¼ 0.031) and all the items had strong standardized factor loadings ranging between 0.71 and 0.84. Further, no consequential violations of monotonicity were found as the rest score curve analyses revealed that, for all items, the probability of endorsing higher categories increased systematically along the latent trait continuum. Psychometric Properties As shown in Table I, all the specific bullying items had large to very large discriminating (alpha) parameters, ranging from 1.83 through 2.77. Q26 “I kept him or her out of things on purpose….” had the smallest value, while q30 “I threatened or forced ….” had the greatest, which thus means that the latter item was better able to discriminate between respondents below and above the “inflection” point. The severity (beta) parameters were also quite large and ranged from 0.95 through 2.35 for the first threshold and 2.15 through 3.14 for the second. As predicted the verbal bullying item q25 “I called another student(s) mean names … and the physical bullying item q29 “I took money or other things…” had the lowest and highest parametervalues, respectively. The test information curve for the total bullying scale is shown in Figure 1. The solid line represents the information curve and is very peaked, as hypothesized. This result shows that the bullying tendency scale provides good to very precise and reliable information about students who are 1 to 3 standard deviations above the mean, that is, in the upper range of the latent score distribution. Since information scores of 5 and 10 equal Cronbach alpha values of approximately 0.80 and 0.90, respectively (reliability ¼ 1-(1/Information), Reeve and Fayers, 2005), it is obvious that students in the upper range of the latent score distribution are measured very reliably. In contrast and as expected, the scale provides less certain information (lower information values) about respondents below the mean, many of whom have responded “it hasn’t happened” on many items. This uncertainty is also revealed by the stapled line in Figure 1 representing the standard error of measurement (SEM) which is inversely related to the information curve (SEM ¼ 1/SQRT (Information)).

Olweus Bullying Scale

7

Fig. 1. Precision of the instrument along the trait level (Cronbach’s alpha ¼ 0.83). Information curve (solid) and standard error of measurement curve (dashed).

Differential Item Functioning As explained in the Method section, we created a new data set which took the age DIF effects into account, when we tested for potential gender DIF effects. The new data set then consisted of eleven items instead of the original eight because all of the items that were signalled as age-DIF were split into two: Q26 younger, q26 older, q27 younger, q27 older, and q28 were flagged as nonignorable uniform gender DIF effects when combined with the purified anchor set (consisting of items q25, q29, q30, q31 younger, and q32). Based upon the results from all the DIF analyses, a final data set was assembled consisting of 16 items. The new item parameters were calibrated with Multilog 7.03. The main threshold results (beta 1 and beta 2) can be summarized as follows. As shown in Table II, there were gender DIF effects as regards the two thresholds for both q26 (social exclusion) and q27 (physical bullying) but in opposite directions: Given the same trait value, girls were generally more likely than boys to endorse q26, the item about social exclusion, whereas boys were generally more inclined to endorse q27, one of the items about physical bullying. Note that the previously identified gender DIF effect on q28, the item about rumour spreading, was very small (

An item response theory analysis of the Olweus Bullying scale.

In the present article, we used IRT (graded response) modeling as a useful technology for a detailed and refined study of the psychometric properties ...
373KB Sizes 0 Downloads 7 Views