Scandinavian Journal of Clinical and Laboratory Investigation Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/iclb20

Validating precision – how many measurements do we need? a

a

a

Arne ÅSberg , Kristine Bodal Solem & Gustav Mikkelsen a

Department of Clinical Chemistry, Trondheim University Hospital, Trondheim, Norway Published online: 17 Jun 2015.

Click for updates To cite this article: Arne ÅSberg, Kristine Bodal Solem & Gustav Mikkelsen (2015) Validating precision – how many measurements do we need?, Scandinavian Journal of Clinical and Laboratory Investigation, 75:6, 496-499, DOI: 10.3109/00365513.2015.1055583 To link to this article: http://dx.doi.org/10.3109/00365513.2015.1055583

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

Scandinavian Journal of Clinical & Laboratory Investigation, 2015; 75: 496–499

ORIGINAL ARTICLE

Validating precision – how many measurements do we need?

ARNE ÅSBERG, KRISTINE BODAL SOLEM & GUSTAV MIKKELSEN

Downloaded by [New York University] at 03:48 02 September 2015

Department of Clinical Chemistry, Trondheim University Hospital, Trondheim, Norway Abstract Background: A quantitative analytical method should be sufficiently precise, i.e. the imprecision measured as a standard deviation should be less than the numerical definition of the acceptable standard deviation. We propose that the entire 90% confidence interval for the true standard deviation shall lie below the numerical definition of the acceptable standard deviation in order to assure that the analytical method is sufficiently precise. We also present power function curves to ease the decision on the number of measurements to make. Methods: Computer simulation was used to calculate the probability that the upper limit of the 90% confidence interval for the true standard deviation was equal to or exceeded the acceptable standard deviation. Power function curves were constructed for different scenarios. Results: The probability of failure to assure that the method is sufficiently precise increases with decreasing number of measurements and with increasing standard deviation when the true standard deviation is well below the acceptable standard deviation. For instance, the probability of failure is 42% for a precision experiment of 40 repeated measurements in one analytical run and 7% for 100 repeated measurements, when the true standard deviation is 80% of the acceptable standard deviation. Compared to the CLSI guidelines, validating precision according to the proposed principle is more reliable, but demands considerably more measurements. Conclusions: Using power function curves may help when planning studies to validate precision. Key Words: Chemistry techniques, analytical/methods, quality control, precision, imprecision, sample size

Introduction When evaluating an analytical method, the laboratory has to define certain performance criteria. Among them is analytical precision, which is related to the random error of measurements and expressed as imprecision. Analytical imprecision may be decomposed into several components [1]. In its simplest form, the analytical imprecision is estimated as the standard deviation (s) of the analytical results in a run (batch) of repeated measurements. Suppose that the manufacturer claims the standard deviation for repeated measurements with a certain method to be less than 2 mmol/L at a certain concentration and we want to assure that the standard deviation really is less than 2 mmol/L. How many repeated measurements do we need to undertake? It depends on two conditions. First, it depends on the true standard deviation (σ) of the method, which we do not know for certain. Intuitively, it should be simpler to assure that σ is less than 2.0 mmol/L if it in fact is 1.0 mmol/L rather than 1.9 mmol/L. Second, it depends

on the confidence we want to have that σ is less than 2.0 mmol/L. According to CLSI guideline EP15-A3 the user has demonstrated precision consistent with the manufacturer’s claim if s is less than or equal to the claim or if σ is not statistically significantly larger than the claim [1]. In this case, we should be testing the null hypothesis that σ equals 2.0 mmol/L if s is found to be greater than 2.0 mmol/L. However, testing that hypothesis after making a few repeated measurements may lead us to accept the null hypothesis that σ is equal to 2.0 mmol/L, because the confidence interval for σ will be very wide and may include 2.0 mmol/L even if σ is far from that value (less or greater). Such a procedure encourages performing small experiments with imprecise estimates of σ. Instead, we propose to make enough measurements to obtain a sufficiently narrow confidence interval for σ. When σ really is less than 2.0 mmol/L, we want the entire confidence interval for σ, for instance its 90% confidence interval, to lie below 2.0 mmol/L in order to assure that σ is less than

Correspondence: Arne Åsberg, Department of Clinical Chemistry, Trondheim University Hospital, N-7006 Trondheim, Norway. E-mail: arne.aasberg@ stolav.no (Received 8 September 2014 ; accepted 14 May 2015) ISSN 0036-5513 print/ISSN 1502-7686 online © 2015 Informa Healthcare DOI: 10.3109/00365513.2015.1055583

2.0 mmol/L, analogous to the way of thinking when assessing mean equivalence [2]. In planning the experiment we have to guess on the magnitude of σ; say it is 1.6 mmol/L. Then we could calculate the necessary number of measurements to make the entire 90% confidence interval for σ to lie below 2.0 mmol/L. However, if σ is exactly 1.6 mmol/L, s would exceed σ in 50% of such experiments and consequently the upper limit of the confidence interval would exceed 2.0 mmol/L, indicating that σ might not be less than 2.0 mmol/L. A 50% probability of failure may be too high a risk to take, so we need to make more measurements – but how many more? If we consider the situation where the upper limit of the 90% confidence interval for σ exceeds the numerical value of the acceptable standard deviation as a failure to assure acceptable imprecision, we need power function curves to tell us the probability of failure as a function of σ and the number of measurements (actually the degrees of freedom, see below). To our knowledge, such power function curves are not available, so we have constructed relevant curves and present them in this paper.

497

version 13 (http://www.stata.com) to draw the curves as straight lines through the points.

Results Figure 1 shows the probability of failure to assure acceptable imprecision as a function of σ, where σ is given as a fraction of the acceptable standard deviation. The probability of failure increases with increasing σ and with decreasing degrees of freedom (decreasing number of measurements). The curves representing different degrees of freedom converge at a probability of 0.95 when σ is equal to the acceptable standard deviation. For instance, when σ is 0.8 times (80% of) the acceptable standard deviation, the probability of alarm is 0.42 (42%) for an experiment with 39 degrees of freedom and 0.07 (7%) for an experiment with 99 degrees of freedom, corresponding to 40 and 100 measurements, respectively, for simple experiments of repeated measurements in one analytical run.

Discussion Methods We used computer simulation to calculate the probability that the upper limit of the 90% confidence interval for σ would be equal to or exceed the numerical definition of the acceptable standard deviation. For each selected σ, given as a fraction of the acceptable standard deviation, the computer program drew a specified number (n) of random figures from a Gaussian (Normal) distribution. Then the program calculated the sample s and the upper limit of the 90% confidence interval for σ as [(ν⋅s2)/L]0.5 where ν is the degrees of freedom (n ⫺ 1 in these cases), s2 is the sample variance, and L is the lower tail 5 percentage point of the χ2 distribution with ν degrees of freedom [3]. If the upper limit of the 90% confidence interval was equal to or exceeded the numerical definition of the acceptable standard deviation, an ‘alarm’ was registered. As σ was always below the acceptable standard deviation, each ‘alarm’ was a false ‘alarm’ of unacceptable standard deviation. This procedure was repeated 1 million times for each combination of n and σ. The probability of failure to assure acceptable imprecision was calculated as the number of ‘alarms’ divided by 1 million. For selected numbers of degrees of freedom from 9 to 199 we constructed power function curves as the probability of failure to assure acceptable imprecision plotted against the true standard deviation given as a fraction of the acceptable standard deviation. To make reasonably smooth curves, we determined 21 equidistant points on each power function curve. We used the software Statistics101 version 2.8 (http://www. statistics101.net) for the simulations and Stata

The power function curves in Figure 1 are constructed to show the probability of failure when we try to assure that σ is less than the acceptable standard deviation. We chose to plot failure probability curves rather than success probability curves, because the decision on the number of measurements is a trade-off between a low probability of failure and a high cost of doing many measurements. We believe workers in laboratory medicine are well trained to understand these curves, as the power function curves of quality control rules have the same form [4]. We constructed the curves as a function of the ratio between σ and the acceptable standard deviation, in order to avoid making multiple figures. The curves are constructed for certain degrees of freedom, so the 1.0

Degrees of freedom 9 14 19 29 39 59 79 99 199

0.8 Probability

Downloaded by [New York University] at 03:48 02 September 2015

Validating precision

0.6

0.4

0.2

0.0 0.5

0.6 0.7 0.8 0.9 1.0 σ as a fraction of acceptable standard deviation

Figure 1. The probability of failing to assure acceptable imprecision plotted as functions of the true analytical standard deviation (σ), which is given as a fraction of the acceptable standard deviation. The various curves represent different degrees of freedom. For simple repeatability experiments in one analytical run the number of measurements is equal to the degrees of freedom plus 1. See Methods for further explanations.

Downloaded by [New York University] at 03:48 02 September 2015

498

A. Åsberg et al.

user has to calculate the appropriate number of measurements for each experimental design. For simple repeatability experiments in one analytical run the degrees of freedom are equal to the number of measurements minus 1, i.e. the number of measurements is the degrees of freedom plus 1. For repeatability experiments over several runs with one pair of duplicate measurements from each run, the degrees of freedom are equal to the number of pairs. The curves are drawn for the ratio interval 0.5– 1.0, as the interval 0.0–0.5 is less interesting. We used computer simulation to avoid the complex calculations based on non-central distributions. Each curve is based on 21 points and each point on 1 million computer simulations, so the curves are precisely drawn. They converge at a probability of 0.95, as expected for a one-sided situation with a 90% confidence interval. The formulas for calculating the limits of the confidence interval for σ depend on the assumption of Gaussian distribution of the measurements, which of course is fulfilled for the computer simulations but not necessarily so for real measurements. However, in most circumstances we assume that repeated measurements are Gaussian distributed. Formally, the curves apply only to σ, not the coefficient of variation (CV), i.e. σ as a percentage of the mean analytical result [5]. However, the curves are valid for the ratio of true to acceptable CV if n is greater than 25 or if CV is less than 10%, because then the confidence interval for CV can be precisely calculated using a chi square distribution in the same way as for the standard deviation [5]. In any case, if the acceptable imprecision is given as a CV, we can calculate the corresponding acceptable standard deviation at a given concentration. Now, let us return to the example of assuring that σ is less than 2.0 mmol/L when it in fact is 1.6 mmol/L. We will accept that σ is less than 2.0 mmol/L if the entire 90% confidence interval for σ is below 2 mmol/L, because then we are 95% sure that σ is less than 2.0 mmol/L [6]. If the entire 90% confidence interval is above 2.0 mmol/L we will accept that σ is larger than 2.0 mmol/L, and if the 90% confidence interval includes 2.0 mmol/L the uncertainty will persist. We could determine the necessary number of repeated measurements to make the entire 90% confidence interval lie below 2.0 mmol/L [3]. If s was found to be 1.6 mmol/L based on 36 repeated measurements the upper limit of the 90% confidence interval for σ would be {[(36 ⫺ 1)⋅1.62]/22.465}0.5 mmol/L ⫽ 1.997 mmol/L and {[(35 ⫺ 1)⋅1.62]/21.664}0.5 mmol/L ⫽ 2.004 mmol/L if based on 35 measurements, so doing 36 measurements would seem to be safe. However, if σ is 1.6 mmol/L, we expect to get s higher than 1.6 mmol/L in 50% of the experiments. Consequently, the probability of failure to assure acceptable imprecision will be near to 50% if we plan to do 36 measurements. Using Figure 1 with the frac-

tion 1.6 mmol/L/2.0 mmol/L ⫽ 0.8 for σ, we find that the probability of failure is a little above 40% (in fact 42%) if we do 40 measurements (degrees of freedom equal to 39) and even if we do 100 measurements (degrees of freedom equal to 99) the probability of failure is more than 5% (in fact it is 7%). So, in this case we might plan to do at least 70 repeated measurements instead of 36 to reduce the probability of failure from near to 50% to 20% or less. Of course, we could start with 36 measurements and hope for the best. Then we would have to do more measurements if the 90% confidence interval for σ included 2.0 mmol/L. Anyway, using Figure 1 allows for a more realistic planning at this single guess of σ. It also gives an impression of how the probability of failure changes with the value of σ. Trying to assure that σ is less than the acceptable standard deviation when in fact it is 95% of the acceptable standard deviation may be a formidable task, requiring hundreds of measurements. On the other side, if σ is 50% of the acceptable standard deviation, we need only 10–15 measurements. This way of thinking is very different from that of the CLSI guideline EP15-A3, where the authors state that a precision claim is accepted if s is less than or equal to the claim or if σ is not statistically significantly greater than the claim [1]. If we take the precision claim in that guideline to represent the acceptable standard deviation in this paper, and use the CLSI guideline to verify the precision, the probability of ‘false rejection’ is 5% if σ is equal to the acceptable standard deviation (claim) [1]. In contrast, for the recommended design of five replicates per run with one run per day for five days [1], we would state that the probability of failure to assure acceptable imprecision is, for instance, about 60% if σ is 20% less than the claim (Figure 1), as the degrees of freedom are equal to 5⋅1⋅(5–1) ⫽ 20 in this case. The major difference between our proposal and the CLSI guideline EP15-A3 is that according to our proposal one would accept the precision claim only if σ is statistically significantly less than the claim, while according to the CLSI guideline one would accept the claim if σ is not statistically significantly greater than the claim. Therefore, one would not be very confident that σ is less than the acceptable standard deviation using the CLSI guideline EP15-A3. In conclusion, we propose that the entire 90% confidence interval for the true standard deviation shall lie below the numerical definition of the acceptable standard deviation in order to assure that the analytical method is sufficiently precise. Using power function curves may help when planning studies to validate precision.

Declaration of interest: The authors report no conflicts of interest. The authors alone are responsible for the content and writing of the paper.

Validating precision References

Downloaded by [New York University] at 03:48 02 September 2015

[1] CLSI. User verification of precision and estimation of bias: Approved guideline, 3rd edn. CLSI document EP15-A3. Wayne, PA: Clinical and Laboratory Standards Institute; 2014. pp. 11–39. [2] Åsberg A, Solem KB, Mikkelsen G. Determining sample size when assessing mean equivalence. Scand J Clin Lab Invest 2014;74:713–5. [3] Kringle RO, Bogovich M. Statistical procedures. In: Burtis CA, Ashwood ER, eds. Tietz textbook of clinical chemistry, 3rd edn. Philadelphia: Saunders; 1999. p. 285.

499

[4] Klee GG, Westgard JO. Quality management. In: Burtis CA, Ashwood ER, Bruns DE, eds. Tietz textbook of clinical chemistry and molecular diagnostics, 5th edn. St. Louis: Elsevier Saunders; 2012. pp. 182–4. [5] Gao Y, Ierapetritou MG, Muzzio FJ. Determination of the confidence interval for the relative standard deviation using convolution. J Pharm Innov 2013;8: 72–82. [6] Åsberg A, Bolann B, Mikkelsen G. Using the confidence interval for the mean to detect systematic errors in one analytical run. Scand J Clin Lab Invest 2010; 70:410–4.