Postgraduate Medicine

ISSN: 0032-5481 (Print) 1941-9260 (Online) Journal homepage: http://www.tandfonline.com/loi/ipgm20

The importance of significance and the significance of importance Richard Riegelman To cite this article: Richard Riegelman (1979) The importance of significance and the significance of importance, Postgraduate Medicine, 66:1, 119-124, DOI: 10.1080/00325481.1979.11715203 To link to this article: http://dx.doi.org/10.1080/00325481.1979.11715203

Published online: 07 Jul 2016.

Submit your article to this journal

View related articles

Citing articles: 4 View citing articles

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=ipgm20 Download by: [University of Calgary]

Date: 07 August 2017, At: 17:51

INTERPRETING MEDICAL STUDIES A Series

The importance of significance and the significance of importance It is statisticaUy significant; therefore, it must be true. It is statisticaUy significant; therefore, it must be important. It is

Downloaded by [University of Calgary] at 17:51 07 August 2017

statistically significant; therefore, it must be clinically useful. You cannot argue with statistical significance. Or can you? This article is designed to demonstrate that the critical reader can and should argue with statistical significance.

Richard Riegelman, MD, MPH

Statistical significance testing is a method for deciding whether a difference between groups is likely to be due to chance. Unfortunately, no statistical method is available for directly testing the truth of a hypothesis or theory about why groups differ. Thus, investigators must resort to a method of proof by elimination. This method is known as statistical significance testing. 1 The method works on the principle that a hypothesis must be either true or false. In applying tests of statistical significance, we assume that the hypothesis is false and see how well the data are explained. If there is only a small probability that the false, or null, hypothesis could explain the data, then we reject the null hypothesis and by elimination accept the study hypothesis. Procedure for statistical significance testing The specific steps in significance testing are as follows. 1. The investigator states the hypothesis to be tested before collecting the data. 2. The investigator assumes the study hypothesis to be false. This

VOL 66/NO 1/JULY 1979/POSTGRADUATE MEDICINE

new formulation is known as the null hypothesis. 3. The investigator determines what level of probability or significance will be considered so small that the null hypothesis will be rejected. In this article, as in the majority of clinical studies, a chance of 5% or less is considered to be unlikely enough to allow the investigator to reject the null hypothesis. The 5%

Third in a series of five articles on interpreting medical studies

figure is traditional; however, it is arbitrary. It leaves open the possibility that chance alone has produced an unusual set of data. Thus, a null hypothesis which is in fact true may be rejected in favor of the study hypothesis. 4. After collecting the data, the investigator calculates the probability that the data would have occurred if the null hypothesis were true. 5. The investigator determines the statistical significance of the study data. If the data obtained are likely to occur only 5% or less of the time if the null hypothesis is true, the investigator may reject the null hypothesis and by elimination accept the study hypothesis. If, however, the probability that the data could occur by chance if the null hypothesis were true is greater than 5%, the investigator may not reject the null hypothesis. This does not mean that the null hypothesis is true. It merely says that the probability that the null hypothesis explains the data is too great to reject it in favor of the study hypothesis. The burden of proof is on the investigator to show that the null hypothesis is quite unlikely before rejecting it in favor of the study hypothesis. The following example illustrates how this significance testing procecontinued

119

MEDICAL STUDIES CONTINUED

Downloaded by [University of Calgary] at 17:51 07 August 2017

In statistical significance testing, the burden of proof is on the investigator to show that the null hypothesis is quite unlikely b•efore rejecting it in favor of the study hypothesis.

dure operates. An investigator wants to test the study hypothesis that mouth cancer is associated with pipe smoking. He formulates a null hypothesis that pipe smoking is not associated with mouth cancer. He then decides to reject the null hypothesis if the study data can be explained by the null hypothesis only 5% or less of the time. He next collects the data, and using probability methods, finds that if there were no association between pipe smoking and mouth cancer, the data would have occurred only 3% of the time. Since the data would be quite unlikely to occur if pipe smoking and mouth cancer were not assrr dated, he decides to reject the null hypothesis. He thus accepts by elimination the study hypothesis that mouth cancer is associated with pipe smoking. Errors in determining significance Investigators may be misled if they do not understand and follow this significance testing procedure, as the next example shows. An investigator randomly selects 100 individuals known to have essential hypertension and 100 individuals known to be free of hypertension and compares the two groups according to a list of 100 variables to determine how they differ. Of the 100 variables studied, two are found to be statistically significant at the 0.05 level by standard statistical methods. The investigator finds that the hypertensives generally have more than five letters in their last

120

names, while the nonhypertensives usually have five letters or less. Second, the hypertensives generally were born during the first three and one-half days of the week, while the nonhypertensives were usually born during the last three and one-half days. The investigator concludes that, despite the fact that he had not foreseen these differences, longer names and birth during the first half of the week are statistically associated with essential hypertension. This example illustrates the importance of stating the hypothesis beforehand. Whenever a large number of variables are tested by the usual methods, some of them are likely to be statistically significant by chance alone. Thus, it is not surprising that this investigator found some variables to be statistically significant. If the hypothesis is not stated beforehand, there can be no null hypothesis to reject. In addition, it is not proper to apply the usual tests of statistical significance unless the hypothesis is stated before the data are collected and analyzed. Even if the statistical procedure is properly carried out, false conclusions can be drawn from statistically significant results, as the following example illustrates. In a review article the author evaluates 20 well-conducted studies

Richard Riegelman Or Riegelman is research director and assistant professor of medicine, department of health care sciences, George Washington University Medical Center, Washington, DC.

which examined the relationship between breast-feeding and breast cancer. Nineteen of the studies found no association between breast-feeding and breast cancer, and one found a statistically significant association at the 0.05 level. The author concludes that breast cancer may be associated with breast-feeding and cites the one study which found an association. If 20 well-conducted studies are performed to test an association which does not exist, one of them may show an association by chance alone. The meaning of statistical significance at the 0.05 level should be remembered. It implies that the results have a 5% chance ( 1/20) of occurring by chance alone if no association exists. Thus, one study in 20 showing an association should not be interpreted as evidence for an association. The previous example demonstrated that the presence of statistical significance does not guarantee the presence of a true difference between groups. On the other hand, the absence of statistical significance does not exclude the presence of a true difference. The process of significance testing allows one to reject or not reject a null hypothesis. It does not allow one to prove a null hypothesis. Two factors can operate to prevent a study from demonstrating a difference when one exists. 2 I. Chance alone may have produced a result which does not fulfill the probability criteria for statistical significance despite the presence of a

VOL 66/NO 1/JULY 1979/POSTGRADUATE MEDICINE

Downloaded by [University of Calgary] at 17:51 07 August 2017

Chance alone may result in the finding of a statistically significant difference where no true difference exists, or conversely, in the absence of statistical significance, a true difference may be present.

true difference. 2. If inadequate numbers of subjects are used in a study, it is said to have inadequate "power" to demonstrate an association between two variables although one exists. Using large numbers to try to demonstrate a difference may be regarded as parallel to using a high-power rather than a low-power microscope to see bacteria. Large numbers increase the resolution of the study. The following example illustrates how inadequate numbers may prevent an investigator from demonstrating a difference. In a study of the adverse effects of cigarettes on health, 100 cigarette smokers and 100 nonsmokers are followed for 20 years. Dur.ing that time, lung cancer develops in five smokers and no nonsmokers and myocardial infarction occurs in ten smokers and nine nonsmokers. The results for lung cancer are statistically significant but those for myocardial infarction are not. The investigators conclude that an association between cigarette smoking and lung cancer has been supported and an association between cigarette smoking and myocardial infarction refuted. When the difference between groups is small, large numbers are required to demonstrate it. Very likely, the numbers used in the preceding example were too small to give the study enough power to demonstrate an association between cigarette smoking and myocardial infarction, even though other studies

VOL 66/NO 1/JULY 1979/POSTGRADUATE MEDICINE

Illustration: Niculae Asciu

suggest that one exists. On the other hand, this study cannot be said to help refute an association between cigarette smoking and myocardial infarction. A study with limited statistical power to demonstrate a difference also has limited power to refute a difference.

Interpreting the difference Having analyzed the data and determined that a properly performed, statistically significant difference exists, the investigator or reviewer must then interpret the difference.

He or she must ask, "Of what importance is this difference and is it a clinically useful distinction?" 3 In answering, two possible limitations must be kept in mind. First, if enough individuals are used in a study, even a very small difference is likely to be statistically significant. In other words, a very large study has the power to demonstrate very small, even inconsequential differences. The following example illustrates how a statistically significant difference may be too small to be clinically important. continued

121

PRELUDIN®



Downloaded by [University of Calgary] at 17:51 07 August 2017

{phenmetrazine hydrochloride NF) Indication: The drug is indicated in the management of exogenous obesity as a short-term (a few weeks) adjunct in a regimen of weight reduction based on caloric restriction. The limited usefulness of agents of this class should be measured against possible risk factors inherent in their use such as those described below. Contraindications: Advanced arteriosclerosis, symptomatic cardiovascular disease, moderate to severe hypertension, hyperthyroidism, known hypersensitivity or idiosyncrasy to sympathomimetic amines, glaucoma. Agitated states. Patients with a history of drug abuse. Concomitant use of CNS stimulants. During or within 14 days following the administration of monoamine oxidase inhibitors (hypertensive crises may result). Warnings: Tolerance usually develops within a few weeks. When this occurs, the recommended dose should not be exceeded in an attempt to increase anorectic effect; rather, the drug should be discontinued. The drug may impair ability to engage in potentially hazardous activities such as operating machinery or driving a motor vehicle; caution the patient accordingly. Drug Dependence: The drug is related chemically and pharmacologically to the amphetamines. Amphetamines and related stimulant drugs have been extensively abused, and the possibility of abuse of this drug should be kept in mind when considering the desirability of including it as part of a weight reduction program. Abuse of amphetamines and related drugs may be associated with intense psychological dependence and severe social dysfunction. There are reports of patients who have increased the dosage to many times that recommended. Abrupt cessation following prolonged high dosage administration results in extreme fatigue and mental depression; changes are also noted on the sleep EEG. Manifestations of chronic intoxication with these drugs include severe dermatoses, marked insomnia, irritability, hyperactivity, and personality changes. The most severe manifestation of chronic intoxication is psychosis often clinically indistinguishable from schizophrenia. Usage in Pregnancy: Safe use in pregnancy has not been established. Animal reproductive studies demonstrated no teratogenic effects. However, the conception rate was adversely affected, as well as survival and body weight of pups at weaning. There have been clinical reports of congenital malformation associated with the use of this compound but a causal relationship has not been proved. Until more information is available, the drug should not be used by women who are or may become pregnant, particularly in the first trimester, unless in the opinion of the prescribing physician the potential benefits outweigh the possible risks. Usage in Children: Not recommended for use in children under 12 years of age. Precautions: Caution should be exercised in prescribing this drug for patients with even mild hypertension. Insulin requirements in diabetes mellitus may be altered in association with this drug and the concomitant 'dietary regimen. The drug may decrease the hypotensive effect of guanethidine. The least amount feasible should be prescribed or dispensed at one time in order to minimize the possibility of overdosage. Adverse Reactions: Palpitation, tachycardia, elevation of blood pressure. Overstimulation, restlessness. dizziness, insomnia, euphoria, dysphoria, tremor, headache; rarely psychotic episodes at recommended doses. Dryness of the mouth, unpleasant taste, diarrhea, constipation, other gastrointestinal disturbances. Urticaria. Impotence, changes in libido. Overdosage: Manifestations of acute overdosage with phenmetrazine hydrochloride include restlessness, tremor, hyperreflexia, rapid respiration, confusion, assaultiveness, hallucinations, panic states. Fatigue and depression usually follow the central stimulation. Cardiovascular effects include arrhythmias, hypertension, or hypotension and circulatory collapse. Gastrointestinal symptoms include nausea, vomiting, diarrhea and abdominal cramps. Poisoning may result in convulsions, coma, and death. Management of acute phenmetrazine hydrochloride intoxication is largely symptomatic and in-:iudes lavage and sedation with a barbiturate. Experience with hemodialysis or peritoneal dialysis is inadequate to permit recommendation in this regard. As with the amphetamines, acidification of the urine should increase phenmetrazine hydrochloride excretion. Intravenous phentolamine has been suggested for possible acute, severe hypertension if this complicates phenmetrazine overdosage. How Supplied: Tablets of 25 mg. Endurets prolonged-action tablets of 50 mg and 75 mg. For complete details, please see the full prescribing information.

MEDICAL STUDIES CONTINUED

If too few subjects are used in a study, it is said to have inadequate "power."

Investigators follow 100,000 middle-aged men for ten years to determine the factors associated with coronary artery disease. They hypothesize beforehand that a high uric acid level may be a factor in predicting coronary artery disease. They find that men in whom coronary artery disease developed had an average uric acid level of 7.8 mgjdl, whereas those in whom coronary artery disease did not develop had an average level of 7.7 mgjdl. The difference is statistically significant at the 0.05 level. The difference is statistically significant but so small that it is not clinically important. The large number of subjects used allowed the investigators to detect a very small difference. However, the small size of the difference makes it unlikely that uric acid level played a major role in the development of coronary artery disease. The small difference does not usefully allow the clinician to identify those persons in whom coronary artery disease will develop. The second limitation of the concept of statistical significance is that the conclusions about differences between groups are based exclusively on mean or average values. Thus, that a difference is statistically significant says that the difference between the average values in the two groups is unlikely to result by chance alone. The presence of a statistically significant difference tells us little about how varied or diverse the values for each group are. It is possible that the values for individu-

als in each group are very close to the average or that they vary widely on either side of the average. In an attempt to avoid the problems that arise from this second limitation, research results are frequently presented with a measure of the variability or dispersion of the groups. This measure is known as a standard deviation (SD). Assuming the population chosen is distributed in approximately a bell-shaped, or normal, distribution, the SD has a convenient property: About 95% of the individual values are within 2 SD on either side of the average. Research results are usually presented as an average ± 1 SD. This makes it possible to easily estimate the extent of overlap between the two populations. For example, assume that population A has scores of 80 ± 10. The 80 is the average and the 10 is 1 SD. Assume that population B has scores of 60 ± 15. Remember that 95% of the individuals in each population will have scores within 2 SD of their average. Thus,

Population A Scores: 80 ± 10 2 SD = 2 X 10 = 20 Therefore, 95% of the scores in population A will be between 100 and 60.

Population B Scores: 60 ± 15 2 SD = 2 X 15 = 30 Therefore, 95% of the scores in population B will lie between 90 and 30.

continued 123

MEDICAL STUDIES CONTINUED

Downloaded by [University of Calgary] at 17:51 07 August 2017

Research results are usually presented as an average plus or minus one standard deviation, which makes it possible to estimate the extent of overlap between the populations.

Thus, values between 60 and 90 demonstrated for even the most inconsequential of differences if the could well come from either popula- Summary number of individuals used in the tion. This reflects a large overlap beThis article has attempted to illus- study is large enough. tween the groups. The degree of overlap is quite im- trate the following principles of sta5. The clinical usefulness of a statistically significant difference may portant in assessment of the clinical tistical significance. 1. An investigator must state the be limited by the degree of overlap. usefulness of the results of a study, as the following example illustrates. hypothesis to be tested beforehand, so This is reflected in the standard deviA study of WBC counts in 10,000 that a null hypothesis can be con- ation around the averages. In answer to the questions posed at viral and 10,000 bacterial infections structed and validly tested using the beginning of this article: If it is finds the following results. Viral standard statistical techniques. infections are associated with WBC 2. A statistically significant differ- statistically significant, it might be counts of 8,000 ± 500 per cubic mil- ence does not prove that a difference true. If it is statistically significant, it limeter, while bacterial infections exists. It merely establishes that if no might be important. If it is statistiare associated with WBC counts of difference exists, there is only a small cally significant, it might be clini9,000 ± 500. The results are statis- chance that the results obtained cally useful. tically significant at the 0.05 level. would have occurred. A small chance The investigators conclude that the is traditionally and arbitrarily de- Next: Contributory cause: Unnecessary and WBC count is extremely useful in fined as 5% or less. However, even insufficient. differential diagnosis of infection, 5% may be too large when important Address reprint requests to Richard Riegelman, MD, Department of Health Care since it can distinguish viral from decisions depend on the results. 3. Failure to establish a statisti- Sciences, George Washington University bacterial infection in most cases. Medical Center, 1229 25th St NW, WashingWhen the reviewer calculates the cally significant difference does not ton, DC 20037. boundaries of 2 SD, this conclusion is prove that no difference exists. brought into serious question. If both Chance alone or the small numbers populations are distributed in a bell- used may have prevented the investi- References shaped curve, then 95% of the viral gators from rejecting the null hypoth- 1. Snedecor GW, Cochran WG: Statistical infections will have a WBC count be- esis. Methods. Ed 6. Ames, la, Iowa State Univertween 7,000 and 9,000 per cubic mil4. A statistically significant differ- sity Press, 1976, p 27 2. Feinstein AR: Clinical Biostatistics. St limeter, while 95% of the bacterial ence may be too small to be clinically Louis, CV Mosby Co, 1977, p 293 infections will have a WBC count be- useful. Statistical significance can be 3. Feinstein,• pp 323-325 tween 8,000 and 10,000. Thus, nearly half of both groups will have values between 8,000 and 9,000. In this range the values are of no clinical usefulness in distinguishing between viral and bacterial infections. Thus, despite the statistical significance of the finding, the large degree of overlap of the values limits its clinical usefulness.

124

VOL 66/NO 1/JULY 1979/POSTGRADUATE MEDICINE

The importance of significance and the significance of importance.

Postgraduate Medicine ISSN: 0032-5481 (Print) 1941-9260 (Online) Journal homepage: http://www.tandfonline.com/loi/ipgm20 The importance of significa...
672KB Sizes 0 Downloads 0 Views