Contents lists available at ScienceDirect

European Journal of Internal Medicine journal homepage: www.elsevier.com/locate/ejim

Original Article

The faulty statistics of complementary alternative medicine (CAM) Maurizio Pandolﬁ a,⁎,1, Giulia Carreras b a b

University of Lund, Sweden Cancer Prevention and Research Institute (ISPO), via delle Oblate 2, 50139 Florence, Italy

a r t i c l e

i n f o

Article history: Received 14 May 2014 Received in revised form 29 May 2014 Accepted 29 May 2014 Available online 18 June 2014 Keywords: Prior probability Complementary alternative medicine Bayes theorem

a b s t r a c t The authors illustrate the difﬁculties involved in obtaining a valid statistical signiﬁcance in clinical studies especially when the prior probability of the hypothesis under scrutiny is low. Since the prior probability of a research hypothesis is directly related to its scientiﬁc plausibility, the commonly used frequentist statistics, which does not take into account this probability, is particularly unsuitable for studies exploring matters in various degree disconnected from science such as complementary alternative medicine (CAM) interventions. Any statistical significance obtained in this ﬁeld should be considered with great caution and may be better applied to more plausible hypotheses (like placebo effect) than that examined — which usually is the speciﬁc efﬁcacy of the intervention. Since achieving meaningful statistical signiﬁcance is an essential step in the validation of medical interventions, CAM practices, producing only outcomes inherently resistant to statistical validation, appear not to belong to modern evidence-based medicine. © 2014 European Federation of Internal Medicine. Published by Elsevier B.V. All rights reserved.

The last two to three decades have witnessed the increasing diffusion and popularity of complementary alternative medicine (CAM) [1], and some interventions of this kind are currently being employed in public hospitals and even taught in university medical faculties. Enlightened supporters of CAM, while underlining the peculiar nature of these interventions, are striving to obtain clinical validation according to the rules of the medicine from which they are distancing themselves in several respects, namely the modern, evidence-based medicine. Randomized, double-blind (when possible), controlled trials (RCTs) have long been the method of choice for testing the efﬁcacy of a medical treatment [2], and an essential step of the procedure is derivation of statistical inference from the data obtained. Accordingly, both in individual trials and in meta-analyses regarding CAM, statistical signiﬁcance is commonly considered a reliable evidence of efﬁcacy [3,4]. Unfortunately, the type of inferential statistics presently used in medicine (frequentist statistics) has some intrinsic ﬂaws to which CAM interventions appear to be particularly vulnerable. As statisticians periodically remind us, the p-value, which is the measure of signiﬁcance expressed by frequentist statistics, is an estimate whose implication is essentially abstract. It represents the probability of obtaining, repeating the same experiment in absence of any difference between the parameters compared (null hypothesis or H0), the same or a more extreme result than that that observed i.e. approximately Pr(data|H0) where the symbol “|” stands for “given”. In this way, ⁎ Corresponding author at: via de' Martelli 7, 50129 Florence, Italy. Tel.: + 39 055283943. E-mail addresses: mauri.pandolﬁ@gmail.com (M. Pandolﬁ), [email protected] (G. Carreras). 1 Former professor of ophthalmology.

the calculation of the p-value is based not only on the obtained results but also on ﬁctive, never observed, data. Thus the p-value provides only indirect evidence for the tested hypothesis. One common error is interpreting the p-value as the probability of H0 given the data i.e. Pr(H0|data). For example, in a clinical trial comparing a treated group with a control group, a p-value of 0.01 is not a 0.01 probability of H0 being true but, as mentioned, the probability of obtaining the same result repeating the experiment given that H0 is true. If we infer this we commit a logical fallacy called “fallacy of the transposed conditional” [5]. One medical example of this error is believing that the probability of red spots on the skin in measles is the same (it is obviously higher) as that of measles in case of red spots on the skin. This mistake of interpretation is not uncommon in experimental science. One example in physics is the controversy which arose in 2013 when specialized media explained the discovery of the boson of Higgs (a fundamental subatomic particle) solely on the basis of the extremely low p-value calculated from the experimental results [6]. Some statisticians have found this adherence to form excessive especially in view of self-evident results. One statistician even called this rigorism “p-value police” [7] while noting that people naturally interpret p-values as posterior probabilities. However, the inconsistency remains. As J Cohen [8] wittily observed, this method of inference “does not tell us what we want to know, and we so much want to know that, out of desperation, we nevertheless believe that it does!”. One may wonder why an inferential method providing only unsatisfactory, indirect evidence is still being widely used in medical research. It is of interest to note that clinicians, in their work, use a different line of thought derived from the theorem of Bayes which provides a direct evidence in terms of posterior probability. In general, this theorem says that the probability of an event A, given the event B, or posterior

http://dx.doi.org/10.1016/j.ejim.2014.05.014 0953-6205/© 2014 European Federation of Internal Medicine. Published by Elsevier B.V. All rights reserved.

608

M. Pandolﬁ, G. Carreras / European Journal of Internal Medicine 25 (2014) 607–609

probability, is equal to the probability of B given A multiplied by the prior probability of A divided the prior probability of B i.e. Pr(A|B) = Pr(B|A) ∙ Pr(A)/Pr(B) (example of application below). Bayes' theorem is fundamental in the interpretation of diagnostic tests whose results depend on the sensitivity and the speciﬁcity of the method used, and on the prevalence of the condition for whose detection/exclusion the test is made. In these circumstances interest does not lie in knowing the indirect evidence provided by an abstract pvalue but in the concrete probability of the presence or absence of a disease, an answer obtainable by applying the Bayes theorem as below: Prðdiseasejtest positiveÞ ¼ Prðtest positivejdiseaseÞ PrðdiseaseÞ= Prðtest positiveÞ: Routine clinical work offers valuable help for a better understanding of the problems connected with clinical research since the procedures followed for medical tests and those for clinical research are remarkably similar. Both procedures are undertaken to prove a hypothesis (veriﬁcation of a disease or of a research conjecture) by performing this task in similar ways. In a paper dated 1987 Browner and Newman [9] described these analogies which are reported (with some changes) in table 1. Analogies 1 and 2 are easy to understand and do not require explanation, as are analogies 3 and 4. Both sensitivity of a diagnostic test and power of the method are the ability to assess a positive value i.e. the positivity of a test in case of disease and correct acceptance of the research respectively. Reciprocally, both 1—speciﬁcity and the p-value concern possible mistaken values i.e. the possibility that a test result is (falsely) positive in a subject without the disease and the probability of committing the error of a false positive research ﬁnding respectively. As for analogy 5, a positive predictive value indicates the proportion of individuals having a disease when the diagnostic test indicates the presence of that disease, while the positive predictive value of a study is the probability that, given a positive result, the research hypothesis is actually true. A reverse line of reasoning is followed for the predictive value of a negative test result or a negative study. As regards analogy 6, the prior probability of disease is the prevalence of the disease (adjusted to the individual patient) while its counterpart is an estimate that the probability the hypothesis examined is true before knowing the study result. The comparison above reveals the main difﬁculty in the interpretation of research results and may explain why an indirect method of ﬁnding evidence such as frequentist statistics is generally being used instead of methods providing direct evidence. The reason resides in the defective correspondence of analogies 6 and, consequently, 5. While we know the prevalence of the disease to be diagnosed we ignore how probable it is that the research hypothesis is true, and therefore we have no deﬁnite value to feed into the equation. Unfortunately, there are no methods capable of giving a quantitative/semiquantitative evaluation of the prior probability of a research hypothesis. However, the literature contains several reasonable criteria for discerning scientiﬁcally plausible hypotheses (and therefore having higher prior probability of being true) from less credible ones. Possibly, the most important criterion of credibility is conformity to science. In this respect CAM interventions, as those proposed by homeopathy, acupuncture, iridology, Bach Flower Remedies, Ayurveda, anthroposophy, etc., have a very low prior probability of speciﬁc efﬁcacy because their asserted modes of action imply a violation of basic science, anatomy and physiology (besides the rules of common sense). Another important requisite of validity is “falsiﬁability”, according to which science differs from pseudoscience in that it aims at the production of falsiﬁable hypotheses [10]. For example, acupuncture is not falsiﬁable since there is currently no satisfactory simulated procedure to serve as a control for blinded clinical studies. Conformity to this requisite is important also from a pragmatic viewpoint since it makes possible the indispensable veriﬁcation process of independent conﬁrmation. One interesting criterion is “Ockham's razor” according to which if two models describe the observations equally well we should choose the

simpler one (e.g. placebo effect in CAM's interventions). Interestingly, Jefferys and Berger [11] mathematically demonstrated that simpler hypotheses have a higher posterior probability of being correct than complex ones. Although perhaps too demanding to serve as a standard signiﬁcance test, the theorem of Bayes can help us to translate the p-value into posterior probability, and ascertain what is the direct support of the p-value for the hypothesis tested. For this purpose we can we use a variant of Bayes' equation expressed in odds form as below: Posterior odds of H0 ¼ Prior odds of H0 Bayes factor where Bayes factor (Bf) is the ratio between the two likelihoods Pr(data| H0) and Pr(data|Ha) i.e. the probability of the data, given the null hypothesis H0, divided by the probability of the data, given the alternative hypothesis Ha. In essence, Bf is a quotient indicating how far apart are the odds we put on H0 before initiating the investigation (prior odds) from the odds after seeing the data or posterior odds [12]. As the quotient is formulated, the smaller is Bf, the smaller is the support for H0. In this respect Bf is comparable to the p-value whose smaller values provide stronger evidence against the null hypothesis. The “minimum Bf” [12] is the smallest amount of evidence that can be claimed for H0 — or the strongest evidence against it — and, when statistical tests are based on a Gaussian approximation, it is calculated from the same information that goes into the p-value therefore being independent from the prior probability of H0. The minimum Bf is equal to exp(− 0.5z^2) where “exp” is the base of natural logarithms and “z” the deviation in standard errors from the mean. A recent manual of epidemiology [13] contains an illuminating table reporting the relations between p-values, minimum Bf, and posterior probabilities of the null hypothesis assuming a “neutral” prior odds of H0 of 1:1 and a “moderately skeptical” prior odds of 9:1. There is also a detailed explanation on how to calculate the minimum Bf. The table clearly shows how relevant is the general reduction in the inferential support provided by the Bayesian procedure in comparison to that given by the p-value. For example, assuming prior odds for H0 = 1:1, a p-value = 0.05, i.e. a current measure of signiﬁcance, corresponds to a non-signiﬁcant (0.13) support of the null hypothesis. The table also shows how inopportune it is to interpret p-values at face value, especially when they are calculated from results obtained testing hypotheses with low prior probability. Thus, a “signiﬁcant” p-value = 0.05 corresponds to a posterior probability of the null hypothesis = 0.57 (slightly in favor of it!) if we test a dubious research hypothesis to whose H0, being “moderately skeptical”, we assign a prior odds of 9:1. Even for a p-value as low as 0.001 the posterior probability of the null hypothesis is just signiﬁcant (0.043) if the assumed prior odds for H0 is 9:1. Accordingly, the p-value = 0.01 obtained in a study on moxibustion in obstetrics (stimulation with hot mugwort of a foot acupuncture point aimed to correct a breech presentation [14]) would result in a posterior probability of only 0.26, the prior odds assumed for such a bizarre intervention being quite low (9:1). The situation regarding the use of acupuncture in pain is only seemingly different. Here an effect is possible since the intervention may have a plausible mechanism of action (secretion of endogenous opioids). Thus, having a “neutral” attitude and assuming prior odds for H0 = 1:1, p-values of 0.001 (as reported in some studies [15]) would give a signiﬁcant (0.043) posterior probability. However, such direct statistical evidence still would not support the unlikely hypothesis tested (beneﬁcial effect following readjustment of the balance of imaginary vital ﬂuids) but rather the much more credible assumption of an unspeciﬁc placebo effect. As E Ernst [16] observed, acupuncture, being exotic, invasive, slightly painful, involving touch and direct contact with a therapist, carries most of the features capable of eliciting a placebo effect. Therefore, when testing acupuncture and also other forms of CAM, more meaningful statistical signiﬁcance would be obtained if the hypothesis on trial was not, as it is customary in these studies, the speciﬁc action of the intervention but its capacity to elicit a placebo response.

M. Pandolﬁ, G. Carreras / European Journal of Internal Medicine 25 (2014) 607–609

609

Table 1 Analogy between diagnostic tests and clinical research studies.

1 2 3 4 5 6

Diagnostic test

Clinical trial

Absence of disease Presence of disease Sensitivity 1—Speciﬁcity Predictive value of a positive (or negative) result Prior probability of disease

H0 is not rejected (research hypothesis is not accepted) Rejection of H0 (acceptance of research hypothesis) Power of the method p-Value Predictive value of a positive (or negative) study Prior probability of research hypothesis

Thus, in clinical studies, simply obtaining statistical signiﬁcance according to the widely used p-value is far from being adequate. It is equally important to be aware of the intrinsic limitations of the inferential method used and of the sources of error connected with its application. Seeing the considerable amount of false positive research results (although sometimes overstated [17]), one may conclude that clinical researchers are often oblivious of these rules of caution. The use of frequentist inferential statistics is popular among researchers also because it is considered “objective” and is easily calculated. For the time being, it seems unlikely that this method will be abandoned in favor of the more informative but demanding Bayesian inference. In the meantime it might be possible to avoid many false positive results by increasing the power of the method and raising the level of signiﬁcance required in order to contrast the tendency of the p-value to overstate the evidence against the null hypothesis. More importantly, the quality of the hypothesis tested should be consistent with sound logic and science and therefore have a reasonable prior probability of being correct. As a rule of thumb, assuming a “neutral” attitude towards the null hypothesis (odds = 1:1), a p-value of 0.01 or, better, 0.001 should sufﬁce to give a satisfactory posterior probability of 0.035 and 0.005 respectively. Precautions of this kind appear to be ineffective as regards complementary alternative medicines because of the inherently low prior probability of the hypotheses tested by these interventions. In this ﬁeld, occasional positive results obtained in the absence of bias should properly be attributed to more credible causes (most often a placebo effect) than the hypothesis examined which generally is the speciﬁc efﬁcacy of the intervention. Since the achievement of meaningful statistical signiﬁcance as a rule is an essential step in the validation of medical interventions, unless some authentic scientiﬁc support to complementary alternative medicines is in the meantime provided, we have to conclude that these practices cannot be considered as evidence-based. Learning points • It is often forgotten that frequentist statistics, commonly used in clinical trials, provides only indirect evidence in support of the hypothesis examined. • The p-value inherently tends to exaggerate the support for the hypothesis tested, especially if the scientiﬁc plausibility of the hypothesis is low.

• When the rationale for a clinical intervention is disconnected from the basic principles of science, as in case of complementary alternative medicines, any positive result obtained in clinical studies is more reasonably ascribable to hypotheses (generally to placebo effect) other than the hypothesis on trial, which commonly is the speciﬁc efﬁcacy of the intervention. • Since meaningful statistical signiﬁcance as a rule is an essential step to validation of a medical intervention, complementary alternative medicine cannot be considered evidence-based. Conﬂict of interests We declare that we have no conﬂict of interest. References [1] Harris PE, Cooper KL, Relton C, Thomas KJ. Prevalence of complementary and alternative medicine (CAM) use by the general population: a systematic review and update. Int J Clin Pract 2012;66:924–39. [2] Glantz S. Primer of biostatistics. McGraw-Hill; 1997 53. [3] Bellavite P, Fisher P. Homeopathy: Where is the bias? Eur J Int Med 2013;274: 612–3. [4] Linde K, Allais G, Brinkhaus B, Manheimer E, Vickers A, et al. Acupuncture for migraine prophylaxis. Cochrane Database Syst Rev 2009(1) CD001218. [5] Wagenmakers EJ, Wetzels R, Borsboom D, van der Maas H. Why psychologists must change the way they analyze their data: The case of Psi. J Personal Soc Psychol 2011;100:426–32. [6] http://for-sci-law-now.blogspot.it/2012/07/probability-that-higgs-boson-has-been. html. [7] http://normaldeviate.wordpress.com/2012/07/11/the-higgs-boson-and-the-p-valuepolice/. [8] Cohen J. The Earth is round (p b .05). Am Psychol 1994;49:997–1003. [9] Browner WS, Newman T. Are all signiﬁcant P values created equal? JAMA 1987;257: 2459–63. [10] Bortolotti L. The Philosophy of Science. Politi 2008:14. [11] http://quasar.as.utexas.edu/papers/ockham.pdf. [12] Goodman S. Towards evidence-based medical statistics. 2: The Bayes factor. Ann Intern Med 1999;130:1005–13. [13] Gerstman B. Epidemiology kept simple: An introduction to traditional and modern epidemiology. Wiley; 2013 213. [14] Neri I, Airola G, Contu G, Allais G, et al. Acupuncture plus moxibustion to resolve breech presentation: a randomized controlled study. J Matern Fetal Neonatal Med 2004;15:247–52. [15] Melchart D, Streng A, Hoppe A, Brinkhaus B, et al. Acupuncture for tension-type headache: randomised controlled trial. BMJ 2005;331:376–82. [16] http://edzardernst.com/2013/02/acupuncture-placebo/. [17] Goodman S, Greenland S. Why most published research ﬁndings are false: problems in the analysis. PLoS 2007;4:e168.