(1990)

FLJNDAMENTALANDAPPLIEDTOXICOLOGY1~,710-721

Exact Statistical Tests for Any Carcinogenic

Effect in Animal Bioassays

II. Age-Adjusted Tests DAVID FARRAR’ AND KENNY CRUMP Clement International Corporation, 1201 Gaines Street, Ruston, Louisiana 71270

Received November 29, 1989; accepted July 6, 1990

Exact Statistical Tests for Any Carcinogenic Effect in Animal Bioassays. II. Age-Adjusted Tests. FARRAR, D., AND CRUMP, K. (1990). Fundam. Appl. Toxicol. 15, 7 10-721. Statistical methods are discussed for application in animal carcinogenesis bioassays that test a general null hypothesis that response frequencies are independent of the treatment level for all tumor endpoints, sexes,and species considered in the bioassay, conditional on survival. These methods are similar to those proposed earlier by Farrar and Crump (1988, Fundam. Appl. Toxicol. 11, 652-663) which involve test statistics that are functions of p values from multiple standard statistical hypothesis tests, and evaluation of significance using randomization. Refinements include the introduction of an exact incidental tumor test analogous to the asymptotic test of Hoe1 and Walburg (1972, J. Natl. Cancer Inst. 49,36 l-372) which takes into account the possibility that any differences among treatment groups in response rates result from differences in survival that is independent of tumors. Additional consideration is given to selection of test statistics, and a test is proposed that is sensitive specifically to cases where the same tumor endpoints are affected in multiple sexes and species. Applications of the procedures to results of National Toxicology Program (NTP) bioassays concerning decabromodiphenyl oxide and iodinated glycerol illustrate the sensitivity of tests based on different test statistics to distinct alternative hypotheses. When compared with the qualitative conclusions of the NTP regarding the strength of evidence for a carcinogenic effect in the same bioassays, the results presented suggest that the methods can be useful in clarifying otherwise equivocal evidence for carcinogenicity. Various difficulties are discussed relevant to the construction of exact combined fatal and incidental tumor tests analogous to the tests of Peto et al. (1980, IARC Monogr. Suppl. 2). o 1990 Society of Toxicology.

The initial step in the evaluation of potential for carcinogenicity to humans of a chemical agent may involve evaluation of the strength of experimental evidence that the agent causes lesions of one or more types in one or more species of nonhuman animals. To evaluate such a hypothesis, it is common to screen a number of cancer endpoints in a bioassay involving two rodent species and both

sexes of each species. For example, bioassays conducted routinely by the National Toxicology Program (NTP) usually consider over 20 endpoints in each sex for rats and mice. In this context, the term “general null hypothesis” may be used for the proposition that the tumor response for each endpoint considered is statistically independent of the level of the agent applied within each sex/species combination considered. The issues involved in evaluation of such a hypothesis may be complex, involving an

’ Present address: Clement International Corporation, 9300 Lee Highway, Fairfax, VA 2203 1.

0272-0590/90$3.00 Copyright 0 1990 by the Society of Toxicology. All rights of reproduction in any form reserved.

710

EXACT

TESTS FOR ANY CARCINOGENIC

integration of statistical and biological considerations. Difficulties that arise in the interpretation of results of standard hypothesis tests, when a large number of such tests are involved, have been discussed by various authors (e.g., Brown and Fears, 198 1; Farrar and Crump, 1988; Haseman, 1984; Heyse and Rom, 1987; Tarone, 1990; Westfall, 1985). As is widely recognized, false positives will occur at some frequency because of normal variation in tumor counts (i.e., statistical significance by conventional criteria will occasionally occur for agents that are actually not carcinogenic). Conventional criteria for statistical significance are designed to yield a false positive rate of approximately 1 in 20 applications, fur each test performed; the probability of one or more false positives in the entire collection of tests may be larger by an uncertain amount. (The latter probability is referred to as the experimentwise error rate.) An obvious approach to solving this problem is to make the criterion for rejection more strict for each test performed. However, it is not trivial to determine how strict the individual tests should be in order to yield a desired experimentwise error rate. Some attempts have been made to measure experimentwise false positive rates for bioassays (Fears et al., 1977; Gart et al., 1979; Haseman, 1983). Such efforts, however, are not likely to lead to firm prescriptions because the experimentwise error probability depends upon factors that may vary from one bioassay to another, including statistical correlations among endpoints and background frequencies of endpoints. An additional complication derives from the fact that the final verdict regarding the general null hypothesis should depend not only on the magnitudes of the smallest p values, but also on the way that the smallest p values are distributed among endpoints, sexes, and species. For example, it seems that two relatively small p values represent stronger total evidence for carcinogenicity if they represent the same tumor endpoint in

EFFECT, II

711

different sexes or species than if they represent biologically unrelated endpoints. However, such considerations have not been incorporated formally in statistical evaluations of bioassay results. The tests proposed by Farrar and Crump (1988) have two primary features: (1) a test statistic is computed that is a function of p values obtained from standard statistical hypothesis tests, for example, the minimum of the p values obtained from Cochran-Armitage trend tests; and (2) the statistical significance of a given value of the test statistic is evaluated using randomization. The p value for a randomization test is the probability of a value of the test statistic as extreme as that actually obtained, if the responses of each unit (e.g., animal) are fixed and independent of assignment to treatment level. (“More extreme” may mean larger or smaller, depending on the test statistic; smaller values are more extreme if relatively smaller values represent relatively more convincing evidence of a true effect of the agent, etc.) In the procedure applied by Farrar and Crump, p values are estimated using a Monte Carlo approach: each animal is represented by a vector of response scores (one for each tumor type), specifying presence, absence, or ignorance for each tumor type. Conventional p values are computed for each response and a value for the test statistic for the general null hypothesis is calculated from these p values. Response vectors are then reassigned at random to treatment groups, conserving the number of animals per treatment group. For each such random reassignment (or permutation), conventional p values are recomputed and the test statistic is recomputed using these p values. The p value used to test the general null hypothesis is estimated by the proportion of reassignments that results in a value of the test statistic as extreme as the value corresponding to the original assignment. Test statistics may be functions of p values from different sexes or species; in this case response vectors are reassigned among treatment

712

FARRAR

AND

groups for a given sex and species, but are not reassigned across sexes and species. The present contribution extends the work of Farrar and Crump ( 1988) primarily in two ways. First, the randomization approach used previously took no account of possible differences in survival between treatment groups, such as may result from toxic effects of treatment. Use of age-adjusted tests based on the Mantel-Haenszel approach are standard practice in animal bioassays (Hoe1 and Walburg, 1972; Peto et al., 1980; Haseman, 1984). Obviously, if animals in a given dose group do not tend to survive to ages when tumors of a given type usually occur, then statistical hypothesis tests based on raw response proportions (numbers of tumor cases divided by number of animals initially available) may be misleading. Randomization methods are discussed here that take into account the possibility of differences in survival between treatment groups. With the introduction of the mortality adjustments, the formulation of the general null hypothesis given above may be reformulated as follows: for each sex/species combination considered, tumor count is statistically independent of the level of the agent applied among animals of similar age for each endpoint. Second, additional consideration is given to the choice of test statistics. If the general null hypothesis is rejected at significance levels a! or smaller (conventionally cx is 0.05) then any choice of test statistic will result in a test with a false positive rate not larger than CY.However, different choices of test statistics may lead to tests with different sensitivities for detecting dose-related increases in tumor count. In particular, a case of the approach discussed by Farrar and Crump which has been advocated previously (e.g., Brown and Fears, 198 1) is a randomization test of the significance of the smallest p value. (Westfall ( 1985) has proposed an approach that is similar, but with evaluation of significance based on a bootstrap distribution.) If the minimum p value is the only test statistic evaluated,

CRUMP

then it is possible that the test will not be sufficiently powerful in cases in which treatment affects tumors of more than one type or the same tumor in more than one species or sex. Examples are given here that illustrate the application of age-adjusted randomization methods, as well as the effects of different choices of test statistics. Methods proposed herein apply only to tumors discovered incidentally at autopsy and not to tumors that cause the death of the animal. Although considerations for developing similar tests for fatal tumors are discussed, no recommendations are made for analysis of such tumors. CHOICE

OF TEST STATISTICS

It is assumed here that a single p value is computed for each distinct disease endpoint in each sex/species combination where the endpoint occurs in the experiment, and that a test statistic is a function of these p values. We adopt a convention of using the same symbol to represent either a test statistic (a function of p values) or a statistical hypothesis test based on the test statistic; for example, we use T1 to represent either the minimum of a set of p values or a test of the general null hypothesis based on the minimum p value. A test statistic may be computed from p values from different sorts of tests (e.g., Fisher’s exact test, Cochran-Armitage trend tests, Mantel-Haenszel tests, etc.). We propose that the p values used correspond to one-sided tests for positive dose-related trends; if two-sided p values are used then the null hypothesis may be rejected for patterns of p values that are difficult to interpret, perhaps involving doserelated increases in tumor count in one sex or species and dose-related decreases in another for the same endpoint. Three test statistics considered here are defined as follows. A test statistic, denoted T1, is defined as the minimum p value, ignoring sex and species. A second test statistic, de-

EXACT

TESTS

FOR

ANY

noted Tz, is computed in two steps as follows: first, we average p values for each endpoint across sex/species combinations where the endpoint has occurred in the experiment (we have used the geometric average); then T2 is the minimum of the endpoint-specific average p values. The third test, denoted T3, is an example of tests proposed previously by Farrar and Crump ( 1988) for which test statistics are products of K smallest p values, for various selections of K. In the examples given here, we have used K = 5, i.e., the test statistic is the product of the five smallest p values. Use of T1 as a test statistic is common in statistical practice (for example, in applications of the Bonferroni method, see Miller, 198 1) and in literature on cancer bioassays in particular (e.g., Mantel, 1980; Brown and Fears, 198 1; Heyse and Rom, 1987; Westfall and Young, 1989; Tarone, 1990). The rejection set for T1 may include configurations of data with a low p value for one endpoint in a single sex/species combination and larger p values elsewhere, including for the same endpoint in another sex or species. T2 is considered as a procedure for increasing sensitivity to effects that are consistent across sexes and species. Comparing T2 and Tl, the rejection set for T2 will consist of configurations of data with low p values representing the same endpoint in multiple sex/species combinations to a larger extent than T1, and, consequently, will include a lower proportion with low p values in isolated sex/species combinations. We do not exclude the possibility of highly sex- and species-dependent effects; however, to the extent that such configurations are uncommon among bioassays involving true carcinogens, use of a test such as T2 may result in increased power. A test statistic such as T3 (the product of the five smallest p values) can be used to combine the evidence of the smaller p values without assumptions regarding biological relatedness of endpoints considered, and thus may result in greater power in cases where biological relationships among tumor endpoints are unclear.

CARCINOGENIC

EFFECT,

II

713

EXACT TESTS FOR INCIDENTAL TUMORS

In most cases, including at least those tumor types that are not generally the cause of death where they occur, the following randomization procedure may be recommended which is analogous to the asymptotic test of Hoe1 and Walburg (1972), and may be termed an exact incidental tumor test. In brief, the randomization method allows only exchanges of animals of the same sex and species, and with similar times of death. For the gth sex/species combination, partition the total days of observation into Kg consecutive intervals denoted I(g, l), I(g, 2), . . . , I(g, K,). (The earliest day assigned to Z(g, i + 1) is one greater than the latest day assigned to Z(g, i).) These intervals need not correspond across sexes or species, either in the number of intervals or in the number of days assigned to particular intervals. We denote by S(Z(g, i)) the set of animals of the gth sex/species combination whose day of death is in the interval I(g, i). These sets may be referred to as deathtime strata. Based on these strata, the Monte Carlo approach for estimating the p values of the general null hypothesis proposed by Farrar and Crump ( 1988), and described earlier herein, is modified to the extent that for each combination of sex and species, random reassignment of response vectors to treatment groups is carried out within each death-time stratum only. The permutation strategy described defines a multivariate distribution of score statistics with the property that the marginal distribution for each tumor type is identical to the univariate distribution. The basic p values that are used in computation of the test statistic may be based on a Mantel-Haenszel approach as follows. For a specific tumor endpoint in a given sex and species, a z statistic is computed using the formulae

714

FARRAR AND CRUMP

z = [Zi IZjDj(O~ - NgOi/Ni) + CC]/

and Vj =

x

E”iCNi

-

Oi)l

(2) N’(Ni - 1) (see Haseman 1984). Here i indexes strata andj indexes treatment groups within strata. CC represents a correction for continuity, Dj is the dose of the agent for the jth treatment group, 0, is the count of tumor cases in the jth treatment group and ith stratum, and Nii is the number of animals in the ith stratum and jth treatment group for which the tumor type of interest is known to be present or absent; Ni = Zjlvij is the number of animals in the ith stratum and Oi = ZiO, is the number of tumor cases reported for the ith stratum. The p value is computed from the z statistic as the tail area of the standard normal curve corresponding to values larger than z. A Cochran-Armitage test differs from a Mantel-Haenszel test based on a single deathtime stratum by a slight difference in the formula for Vi (see Haseman, 1984.) Criteria for a strictly valid test based on the algorithm proposed are identical to criteria for application of the incidental tumor test commonly applied in analysis of tumor bioassay results, for example, by the NTP. Indeed, incidental tumor tests are based on an asymptotic approximation of a distribution generated by precisely the permutation algorithm described here. (For example, tests described here that are defined for the case of a single tumor endpoint in a single sex/species combination reduce to an exact version of the usual incidental tumor test, in that case.) Conditions for the validity of such tests are treated extensively in literature on statistical analysis of bioassay results (e.g., Lagakos and Ryan, 1985; Gart et al., 1986). EXAMPLES

OF DATA

ANALYSIS

The NTP (1986) cancer bioassay for decabromodiphenyl oxide involved application

via inhalation at 0, 25,000, and 50,000 ppm to rats and mice of both sexes (50 animals per sex/species combination, per dose level). The general null hypothesis for this bioassay has been evaluated previously by Farrar and Crump (1988) using randomization tests based on the Cochran-Armitage trend test and Fisher’s exact test, which incorporate no adjustments for possible differences in mortality rates between treatment groups. The only test statistic evaluated for significance that was a function of p values in all sex/species combinations was the minimum p value (similar to the statistic T1 here). The test was conducted with the lesion pancreatic acinar cell (PAC) adenoma removed from the analysis, because the NTP concluded that lesions with this diagnosis included both adenoma and hyperplasia, bringing into question the meaningfulness of statistical results based on the diagnosis. With the data set modified in this way, the minimum p value was still highly significant (p < 0.000 1), apparently due to dose-related trends in cases of neoplastic nodules in the livers (NNL) of male and female rats. When the data for NNL were removed, a test of the minimum of p values from other lesions did not approach significance. The p values from conventional incidental tumor tests, equal to 0.1 or smaller, are enumerated in Table 1. We have tested the general null hypothesis for the decabromodiphenyl oxide bioassay using exact incidental tumor tests as described above for three subsets of the data: ( 1) all tumor types included in the analysis including PAC and NNL; (2) PAC excluded; and (3) both PAC and NNL excluded. (These modifications affect the numbers of tumor types but not numbers of animals.) For each of these sets, a nonmortalityadjusted test was performed by assigning all animals to a single stratum, and an age-adjusted test was performed based on five strata, corresponding to Weeks O-52,53-78,79-92, 93-102, and 103-104.

EXACT

TESTS FOR ANY CARCINOGENIC

715

EFFECT, II

TABLE 1 ~VALUESSMALLERTHANO.

10 FROMINCIDENTALTUMORTESTS,FORDECABROMODIPHENYLOXIDE No. of strata for incidental tumor test 1

Male mice Liver hepatocellular adenoma Multiple organs, NOS” malignant lymphoma, histiocytic type Thyroid follicular cell adenoma Multiple organs, NOS granulocytic leukemia Adrenal adenoma, NOS Female mice Multiple organs, NOS malignant lymphoma, mixed type Multiple organs, NOS malignant lymphoma, histiocytic type Male rats Pancreas acinar cell carcinoma Liver neoplastic nodule Salivary gland fibrosarcoma Preputial/clitoral gland adenoma, NOS Female rats Liver neoplastic nodule Mammary gland adenoma, NOS

p p p p p

= = = = =

0.020 0.048 0.064 0.044 0.042

5 p = 0.065 p = 0.048 p = 0.087 p = 0.044 p = 0.059

p=O.lO

p = 0.12 p = 0.083

p p p p

p p p p

p = 0.079

= 0.0065 = 0.000043 = 0.11 = 0.077

p = 0.024 p = 0.077

= 0.0078 = 0.00001 = 0.097 = 0.057

p=O.OOll p = 0.082

a Not otherwise specified, by examining pathologist.

Three tests of the general null hypothesis were performed for each data subset and stratification scheme based on the test statistics described previously, namely the minimum p value (T, ), the minimum endpoint average p value ( T2), and the product of the five smallest p values ( Ts). (The specification of precisely five p values is largely arbitrary; T3 is considered here as an example of a class of test statistics.) The p values were computed using the formulae given above, but modified by replacing any values smaller than 10V8 by lo-*, before computing test statistics. This was done because occasionally the numerical computation of a p value would return a value of zero when the true value was very small but not zero. (Some of our test statistics involve multiplying p values so that the value of the statistic is zero if one of the contributing p values is zero, regardless of the magnitudes of the otherp values.) In the application of formulae given for p values, no correction for continuity was applied.

The use of a Monte Carlo procedure results in some variation of p value estimates from one computer run to the next; therefore, 95% confidence bounds have been computed for the true p values. (Formulae are given in footnotes of Tables 2 and 3.) In the following discussion, certain conventions are adhered to for reporting results of randomization tests. When a single value is reported, it is implied that both upper and lower bounds can be rounded to that value. For example, if the 95% confidence interval is 0.21 to 0.22, it is economical to report simply p = 0.2. When the upper and lower bound are not identical in at least the first significant digit, we report the confidence interval as a range, for example, p = 0.007 to 0.02. When the estimated p value ‘s zero, implying that no permutations were discovered for which the test statistic was as extreme as the value realized, we report p < U, where U is the upper 95% confidence bound. From Table 2, regardless of the stratification scheme, there is a clear-cut difference be-

716

FARRAR AND CRUMP TABLE 2 RESULTS OF RANDOMIZATION

Test (i) All endpoints included TI 7-2

T3 (ii) Pancreatic acinar carcinoma (PAC) removed’ TI T2

T3 (iii) PAC and liver neoplastic nodules removed’ Tl T2

T3

INCIDENTAL TUMOR TESTS FOR DECABROMODIPHENYL USING THREE SUBSETS OF THE DATA

p value

OXIDE,

95% confidence interval a

Strata

Test statistic value (natural logarithm)

1 5 1 5 1 5

-10 -12 -8.0 -9.2 -28 -29

0 0 0 0 0 0

(O-0.003) (O-0.003) (O-0.003) (O-0.003) (O-0.003) (O-0.003)

1 5 1 5 1 5

-10 -12 -8.0 -9.2 -26 -27

0 0 0 0 0.003 0

(O-0.003) (O-0.003) (O-0.003) (O-0.003) (O-0.003) (O-0.003)

1 5 1 5 1 5

-3.9 -3.1 -3.2 -2.8 -16 -15

0.69 0.95 0.62 0.79 0.61 0.88

estimate

(0.66-0.72) (0.94-0.97) (0.59-0.65) (0.77-0.82) (0.58-0.64) (0.86-0.90)

a For p value estimates larger than zero, upper and lower bounds are given by p + 1.96. [p- ( 1 - p)/N]1/2,where N is the number of iterations (here 1000). For p value estimates equal to zero, lower bounds are given by zero and upper bounds are given by 1 - (0.05)‘lN. b Reported for male rats. c Reported for male and female rats.

tween results for the first two sets of results (all tumor types, or all but PAC), where the general null hypothesis can be rejected with high confidence, and results for the final set (PAC and NNL removed), where p values are uniformly larger than 0.5. These results suggest strong statistical evidence of carcinogenicity in rats (where NNL and PAC were reported) and no evidence of carcinogenicity in mice. This conclusion appears to be supported by a supplemental analysis (Table 3): each sex/species combination was analyzed in isolation, so that tests T, and T2 were

equivalent. All tumor endpoints were included in the analysis. Considering only the age-adjusted tests (strata = 5), the results for male rats are highly significant (p < 0.003 for both tests). For female rats considered alone the results are somewhat less significant (p = 0.007 to 0.02 for T1 ; p = 0.2 for T3). For mice there appears to be no evidence of an effect. (Results with a single stratum are given only to illustrate the effect of an age correction.) The NTP (1986) concluded that there was some evidence of carcinogenicity in rats, equivocal evidence of carcinogenicity in male

EXACT

TESTS FOR ANY CARCINOGENIC

717

EFFECT, II

TABLE 3 RESULTS OF RANDOMIZATION

Test Female mice Tl T3 Male mice Tl T3 Female rats Tl T3 Male rats Tl T3

INCIDENTAL INDEPENDENTLY

TUMOR TESTS FOR DECABROMODIPHENYL FOR EACH SEX/SPECIES COMBINATION

OXIDE,

CONDUCTED

Strata

Test statistic value (natural logarithm)

p value estimate

95% confidence interval’

1 5 1 5

-1.1 -1.1 -5.1 -4.8

0.68 0.68 0.43 0.55

(0.66-0.7 1) (0.65-0.7 1) (0.40-0.46) (0.52-0.58)

1 5 1 5

-1.7 -1.4 -7.0 -6.2

0.24 0.5 1 0.039 0.11

(0.2 l-0.26) (0.48-0.54) (0.027-0.05 1) (0.09-O. 13)

1 5 1 5

-2.6 -3.0 -5.8 -6.1

0.019 0.015 0.24 0.19

(0.0 1l-0.027) (0.007-0.023) (0.2 l-0.26) (0.16-0.2 1)

1 5 1 5

-4.4 -5.0 -9.5 -10

0 0 0 0.00 1

(O-0.003) (O-0.003) (O-0.003) (O-0.003)

a For p value estimates larger than zero, upper and lower bounds are given by p f 1.96. [p- ( 1 - p)/N]“*, where N is the number of iterations. For p value estimates equal to zero, lower bounds are given by zero and upper bounds are given by 1 - (0.05) ‘IN. Results are based on 1000 iterations, with the exception that results for male mice with a single stratum are based on 5000 iterations.

mice, and no evidence of carcinogenicity in female mice. However, results in Table 3, combined with the results from the third analysis in Table 2, strongly suggest that NNL in male and female rats was the only lesion affected by treatment. This example illustrates the potential usefulness of the methods proposed in clarifying equivocal evidence. Results in Table 3 also illustrate the utility of age adjustment. Whereas the test T3, when unadjusted (i.e., based on a single stratum), found a significant effect in male mice, the corresponding age-adjusted test did not. This result justifies our concern, raised in Farrar and Crump (1988), that the marginal significance found for male mice in the non-age-ad-

justed analyses might be due to low survival among control mice. Unfortunately, the results presented in Table 2 are not useful for comparing different tests because where an effect of treatment apparently exists (the first or second set of results), all tests are so highly significant that p value estimates not equal to zero are difficult to obtain. It appears that a particular endpoint is affected in multiple sex/species combinations (NNL in male and female rats); however, the special usefulness of T2 in such cases, which we have postulated, cannot be evaluated. The results in Table 3 suggests that T1 may provide more power than T3 in cases where only a single endpoint is affected (NNL in female rats).

718

FARRAR

AND

CRUMP

As a second example of application of the ties. The results presented here suggest that, tests proposed, the NTP (1988) bioassay for as anticipated, T3 may be valuable in cases of carcinogenicity of iodinated glycerol in- highly sex- and species-dependent effects, for volved application in water by gavage to rats determining whether the combined evidence and mice of both sexes. The agent was adminof the smallest p values is sufficient to reject istered at levels 0, 62, and 125 mg/kg to fe- the general null hypothesis on statistical males of each species and at 0, 125, and 150 grounds. mg/kg to males (50 animals per sex/species combination at each dose level). Survival of EXACT TESTS FOR FATAL TUMORS high-dose male rats was significantly lower AND COMBINATIONS OF than survival of controls after Week 86, but INCIDENTAL AND FATAL TUMORS no other significant differences in survival between dosed groups and controls were found in either species. It was concluded by the For tumor types that are fatal, life-table NTP that there is some evidence of carcinomethods are more appropriate than tests of genic activity in male rats and female mice the sort described above (Peto et al., 1980). In based on results for mononuclear cell leuke- such methods the response evaluated is death mia and follicular cell carcinomas of the thy- from tumor rather than presence of tumor. roid gland in male rats, and on adenomas of The fatal tumor test as applied by the NTP the anterior pituitary and neoplasms of the relies on a version of Formula (1); however, Harderian gland in female mice, but that various quantities that enter this formula are there is no evidence of carcinogenic activity redefined, so that each stratum corresponds in female rats or in male mice. Thus the qual- to a day of the experiment on which one or itative conclusions of the NTP suggest effects more tumor fatalities occurred and consists that are highly sex and species dependent. of all animals at risk on that day (e.g., animals The general null hypothesis for these data alive at the start of the day). Thus, whereas in has been tested using an exact incidental tu- the incidental tumor test each animal is repmor test. Five strata were defined for this test resented in precisely one stratum, correin all sex/species combinations: O-52 weeks, sponding to the set of days that includes its 53-78 weeks, 79-92 weeks, 93 weeks to the date of death, in the fatal tumor test, an aniend of the week before the terminal kill pe- mal is represented in each stratum correriod, and the terminal kill period itself. The sponding to a day of observation earlier than terminal kill period was Weeks 112 to 114 for its recorded date of death. Life-table tests may mice and Weeks 111 to 112 for rats. (These result in more powerful tests when types of are the strata reported by the NTP for their tumors are involved that are rapidly lethal. incidental tumor tests.) As an extreme example, in the case of a highly The same three randomization tests were lethal tumor type, it may happen that an early performed as in the previous example (deca- death-time stratum defined for an incidental bromodiphenyl oxide), combining all sexes tumor test includes only animals that died as and species in a single analysis. Based on a result of tumors. Then the information 1000 iterations, T3 was more highly signififrom these tumors is lost because a stratum cant than the other tests (p = 0.006); T1 was with no variation in the response variable will also significant (p = 0.027) but T2 was not make no contribution to the result of the z significant (p = 0.098). The conclusions of test. By contrast, if a fatal tumor test is apthe NTP for this example suggest that evi- plied, then variation will be present for each dence of carcinogenicity may involve effects day on which some animals die from the tuthat are not consistent across sexes and spe- mor of interest and others do not.

EXACT

TESTS FOR ANY CARCINOGENIC

EFFECT, II

719

For valid application of a fatal tumor test combined incidental-and-fatal test of Peto et it is important to have reliable information al., will encounter several problems. First, faon the context of observation of particular tu- tal tumor tests condition on the number of mors, especially when treatment groups vary animals alive on days when tumor fatalities in mortality related to variables other than occur. It is not generally possible to reassign the tumor of interest. (Tumors are said to oc- animals to treatment groups in such a way cur in a fatal context when they are judged to that these numbers of animals are conserved be the cause of death of the animals in which simultaneously in all strata, because of the they are diagnosed; tumors discovered in ani- fact that data from an individual animal may mals dying from causes other than tumors are contribute to multiple strata. An idea which said to occur in an incidental context.) If, for we have given some consideration involves example, incidental tumors are mistakenly treating the contributions of a given animal categorized as fatal, then a spurious dose-re- to multiple strata as independent data. Seclated effect may appear, simply because of a ond, if randomization is to be performed strahigher rate of discovery of incidental tumors tum-by-stratum, as elaborated above for inciat higher doses. dental tumor tests, then it is not clear how Context of observation assignment is randomization can be performed simultawidely considered to be difficult in practice. neously for incidental and fatal tumor analyThe NTP routinely conducts fatal tumor tests ses, given that the strata defined for the two for each tumor type occurring in a bioassay sorts of tests do not necessarily correspond. without context of observation data, assum- Third, it is not clear what analyses can be ing that all cases of the tumor are fatal unless done if context of observation assessments they are discovered as a consequence of a sac- are not available for each tumor, since the farifice. Apparently, the results of these tests are tal tumor test requires that each animal death allowed to influence the final assessment of be assigned either to a particular type of tuweight of evidence only for tumor types con- mor or to some competing cause. sidered to be highly lethal. In view of these problems, it is clear that If context of observation is available for all permutation tests that involve fatal tumors tumors, and fatal and incidental tumor tests are not as straightforward as those that inare performed using, respectively, the fatal volve incidental tumors only. No particular and incidental subsets of tumors, then it de- recommendations for tests involving fatal tusirable in some manner to combine the re- mors stand out as being clearly warranted. sults of the two tests to allow an assessment of the total weight of evidence for carcinogeDISCUSSION nicity based on all observed tumors. In answer to this need, Peto et al. (1980) recommend combining the results of the incidental The present contribution extends the work tumor and fatal tumor computations, to of Farrar and Crump ( 1988) in primarily two compute a single p value for each tumor type, ways. First, additional consideration is given by treating the strata defined separately for to the basis for selection of test statistics. Speincidental and fatal tumor tests as indepencifically, the test T2, based on averaging p valdent strata, and combining across all strata ues for each endpoint followed by selection of in the same manner as results are combined the minimum average p value, is introduced across strata in the separate incidental and fa- to provide sensitivity in cases where effects tal tumor tests. are consistent across sexes or species. Second, Efforts to derive an exact randomization randomization tests of the general null hytest analogous to the fatal tumor test, or to the pothesis are introduced, which are analogous

720

FARRAR

AND CRUMP

to the incidental tumor test of Hoe1 and Walinto the computation of a test statistic along burg (1972) and which account for possible with p values from trend tests. differences in survival among treatment However, other sorts of information are not incorporated, including the evidence of groups. The methods described herein can be used historical background rates of particular tuto define a large number of distinct tests, but mor types, and the biological plausibility of to apply many of them in a single analysis an effect of a given agent on a given tumor may create a new multiple comparison prob- endpoint, given current knowledge. In particlem. One possible strategy would be to apply ular, the biological considerations are certain T1 and one other test, such as either T2 or T3, to be complex and to vary from one bioassay in which effects at different endpoints or to the next. In view of the complexity of the issues involved, there is perhaps some value different animal groups are able to reinforce one another. However, it is quite possible that in reiterating the principle that a final decithe values of different test statistics are so sion cannot be based on a rigid statistical dehighly correlated that the false positive rate is cision rule. However, it is precisely because not substantially increased by applying sev- of the complexity of the issue involved that methods of the sort proposed here may be eral tests. The problem of multiple comparivaluable; such methods may allow some consons might also be mitigated by a hierarchial siderations to be eliminated to a large degree approach in which an overall test for carcino(for example, by reducing the likelihood of a genicity is first performed and, if this test is false positive as a statistical artifact to a small, positive, one then proceeds to test individual quantifiable level), thus reducing the comanimal groups and eventually individual endplexity of the decision process. points in individual groups. The final decision regarding whether results of a bioassay support the conclusion that ACKNOWLEDGMENTS an agent can cause cancer in animals requires Drs. Joseph Haseman and Keith Soper made a numintegration of a complex array of considerations. The methods proposed formally inte- ber of helpful comments on an earlier version of this manuscript. This work was made possible by a grant grate some of these considerations. The com- from Merck, Sharp, and Dohme. bined evidence of multiple tumor endpoints is evaluated using tests that take into account REFERENCES the consistency of effects across sexes and species. The rate of false positives is controlled BROWN, C., AND FEARS, T. (1981). Exact signifiby limiting the number of tests performed. cance levels for multiple binomial testing with appliThe issue of whether effects are dose related cation to carcinogenicity screens. Biometrics 37, is taken into account by basing test statistics 763-774. on p values from trend tests. The possibility FARRAR, D., AND CRUMP, K. (1988). Exact statistical tests for any carcinogenic effect in animal bioassays. of spurious results due to differences in morFundam. Appl. Toxicol. 11,652-663. tality rates among treatment groups is acFEARS,T., TARONE, R., ANDCHU, K. (1977). False-posicounted for by use of new age-adjusted rantive and false-negative rates for carcinogenicity domization tests. In addition, Farrar and screens. Cancer Res. 37, 194 1- 1945. Crump ( 1988) argue that nonlinearity of GART, J., CHU, K., AND TARONE, R.(1979). Statistical issues in interpretation of chronic bioassay tests for dose-response relationships can be accarcinogenicity. J. Natl. Cancer Inst. 62,957-974. counted for by incorporating p values from GART, J., KREWSKI, D., LEE, P., AND~AHRENDORF, J. pairwise comparisons of treated groups to the ( 1986). The Design and Analysis of Long-Term control group (e.g., from Fisher’s exact test) Animal Experiments. International Agency for

EXACT

TESTS FOR ANY CARCINOGENIC

Research on Cancer (W.H.O.) Scientific Publications No. 79. HASEMAN, J. (1983). A re-examination of false positive rates for carcinogenesis studies. Fundam. Appl. Toxicol. 3,334-339. HASEMAN, J. ( 1984). Statistical issues in the design, analysis and interpretation of animal carcinogenicity studies. Environ. Health Perspect. 58,385-392. HEYSE, J., AND ROM, D. (1987). Adjustingfor Multiplicity of Statistical Tests in the Analysis of Carcinogenicity Studies Using Multiresponse Randomization Tests. Presented at the Annual Meeting of the American Statistical Association. HOEL, D., AND WALBURG, H. ( 1972). Statistical analysis of survival experiments. J. Natl. Cancer Inst. 49,36 l372. LAGAKOS,

S., AND RYAN, L. (1985). On the representativeness assumption in prevalence tests of carcinogenicity. Appl. Stat. 34,54-62. MANTEL, N. ( 1963). Chi-square tests with one degree of freedom: Extensions of the Mantel-Haenszel procedure. J. Amer. Stat. Assoc. 58,690-700. MANTEL, N. (1980). Assessing laboratory evidence for neoplastic activity. Biometrics 36,38 l-399.

EFFECT, II

721

R. (198 1). Simultaneous Statistical Inference. Springer-Verlag, New York. National Toxicology Program (NTP) ( 1986). Toxicology and Carcinogenesis Studies of Decabromodiphenyl Oxide. Technical Report 309, U.S. Department of Health and Human Services. National Toxicology Program (NTP) (1988). Toxicology and Carcinogenesis Studies of Iodinated Glycerol. Technical Report 340, U.S. Department of Health and Human Services. PETO, R., PIKE, M., DAY, N., GRAY, R., LEE, P., PARISH, N., PETO, J., RICHARDS, S., AND WAHRENDORF, J. ( 1980). Guidelines for simple sensitive significance tests for carcinogenic effects in long-term animal experiments. Long-Term Screening Assays for Carcinogens: A Critical Appraisal. IARC Monogr. Suppl. 2. TARONE, R. (1990). A modified Bonferroni method for discrete data. Biometrics 46,5 15-522. WESTFALL, P. ( 1985). Simultaneous small-sample multivariate Bernoulli confidence intervals. Biometrics 41, 1001-1013. WESTFALL, P., AND YOUNG, S. (1989). P-value adjustments for multiple tests in multivariate binomial models. J. Amer. Stat. Assoc. 84,780-786. MILLER,

Exact statistical tests for any carcinogenic effect in animal bioassays. II. Age-adjusted tests.

Statistical methods are discussed for application in animal carcinogenesis bioassays that test a general null hypothesis that response frequencies are...
1MB Sizes 0 Downloads 0 Views