Preventive Veterinary Medicine 113 (2014) 298–303

Contents lists available at ScienceDirect

Preventive Veterinary Medicine journal homepage: www.elsevier.com/locate/prevetmed

The data – Sources and validation Ulf Emanuelson ∗ , Agneta Egenvall Department of Clinical Sciences, Swedish University of Agricultural Sciences, POB 7054, SE-75007 Uppsala, Sweden

a r t i c l e

i n f o

Article history: Received 17 February 2013 Received in revised form 19 September 2013 Accepted 1 October 2013 Keywords: Database Observational studies Secondary data Validation

a b s t r a c t The basis for all observational studies is the availability of appropriate data of high quality. Data may be collected specifically for the research purpose in question (so-called “primary data”), but data collected for other purposes (so-called “secondary data”) are also sometimes used and useful in research. High accuracy and precision are required (irrespective of the source of the data) to arrive at correct and unbiased results efficiently. Both careful planning prior to the start of the data acquisition and thorough procedures for data entry are obvious prerequisites to achieve high-quality data. However, data should also be subjected to a thorough validation after the collection. Primary data are mainly validated through proper screening, by using various descriptive statistical methods. Validation of secondary data is associated with specific conditions – the first of which is to be aware of the limitations in its usefulness imposed by procedures during collection. Approaches for validation of secondary data will be briefly discussed in the paper, and include patient chart review, combining with data from other sources, two-stage sampling, and aggregated methods. © 2013 Elsevier B.V. All rights reserved.

1. Introduction Observational studies obviously rely on the availability of observations. Data based on such observations need to be of sufficient quality to avoid or minimize the risk of drawing false conclusions, i.e. falling victim to the “garbage in, garbage out” (GIGO) trap. This term was first used in 1963 within computer science (Anonymous, 2013) where computers were seen as unquestioningly processing the most nonsensical of input data (“garbage in”) and therefore producing nonsensical output (“garbage out”; Fig. 1). Nowadays, GIGO is also used in other areas and is equally applicable to, for instance, the field of analytical epidemiology – where we also can experience its counterpart where the modeling is miss-specified (Fig. 1; but that is a topic for other contributions of this special issue of Preventive Veterinary Medicine). For this paper, it is pertinent to realize that all data have errors – full stop! The task at hand is first and foremost to minimize data errors

∗ Corresponding author. Tel.: +46 0 18 67 18 26; fax: +46 0 18 67 35 45. E-mail address: [email protected] (U. Emanuelson). 0167-5877/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.prevetmed.2013.10.002

as much as possible, but also to identify errors so that their effects on the output can be identified and (hopefully) reduced. This paper is a summary of a presentation given at the Schwabe Symposium in December 2012 honoring the lifetime achievements in epidemiology and preventive veterinary medicine of Dr. Ian Dohoo; we briefly review’s steps that can be taken to minimizing errors.

2. General recommendations General recommendations on how to avoid systematic errors (i.e. reducing bias) and random variation (i.e. increasing precision) in data to be used in observational studies (and, indeed, most other types of studies) can be grouped into pre-execution, during-execution, and postexecution actions. Most such steps should be obvious to all who have a basic training and understanding in research, but can easily be overlooked and therefore are worthwhile to identify. Prior to executing a project, the most important issue – but one sometimes forgotten or at least unheeded – is to formulate a clear hypothesis that is parsimonious and

U. Emanuelson, A. Egenvall / Preventive Veterinary Medicine 113 (2014) 298–303

299

3. Primary data

Fig. 1. The garbage in, garbage out paradigm.

testable. The hypothesis is a foundation for: identifying the most suitable study design; deciding what data to record and how to record them (to make it possible to test the hypothesis); and calculating an appropriate sample size. A proper hypothesis is therefore crucial for the validity of a research project. A next step in planning a project is to clearly define all observations that will be made in terms of unit of observation, types of variables (continuous, categorical, etc.), precision of measurements, etc. prior to the actual recording. Much can be said about how to record observations, but that is outside the scope of this paper. However, it is worth emphasizing that when data are collected through questionnaires, it is important to validate the questionnaire thoroughly prior to execution, e.g. for use with different languages (Dufour et al., 2010). A possible first step could be to conduct a pre-pilot test, which can be performed on a convenience sample of subjects (or their owners) or others that might not necessarily be part of the target population (e.g. colleagues, experts in the field). A proper pilot test should definitely be performed on a reasonable number of subjects who are representative of the target population for the questionnaire – but that will not be included in the actual sample. An appropriate piloting process allows the investigator to identify questions that are confusing or where there would be no variation in the answers. Finally, in all research it is important to ensure that all persons involved in the gathering of data understand their role in the project and are properly trained for their task – and perhaps continuously updated to avoid drift in data recording. A continuous monitoring of the data that are recorded during the execution of a study is good practice. In some cases, missing values can be updated immediately when data handling takes place almost immediately after data have been compiled. Errors (or deviations e.g. in diagnostic tests) can also be discovered in time – but corrections are not easy to make if errors are discovered only after the completion of the study. Data-entry procedures should therefore also be in place during the execution phase, and these preferably could be designed so as to minimize the risk of errors at data transfer or typing errors. Finally, data should be validated after they have been gathered – which is the topic for the remainder of this paper.

Primary data are data that have been collected with a specific research question (hypothesis) in mind. Their validation starts when the information is recorded. Ideally (and as we already pointed out), this should have already started before or during the execution of the study. Observations might be recorded in paper form – but at some point, all information will be put in a computerized format. Spreadsheets are convenient to use for that purpose, and therefore quite commonly are used – but they must be used with caution because it is possible to sort individual columns (and thus completely destroy the data), and their “seeming credibility” may cause unwanted (and unnoticed) changes of data. It is also more difficult to trace data edits using spreadsheets. A much better option would be to use a general-purpose database manager, of which there are several commercial alternatives (e.g. MS Access) but also within the public domain (e.g. OpenOffice, EpiData). Not only are the database managers not prone to the same errors as spreadsheets, but also they allow some error checking to be done at data entry (e.g. by using input masks or consistency checks). It is also worthwhile to consider a relational database when data are hierarchical in nature, because entering some of the information in duplicate can be avoided (and this minimizes the risk of inconsistencies). To reduce typing errors, data should be entered twice, with an automatic comparison between the entries. An alternative is to proofread all or parts of the data against original records. Data might also be scanned and parts checked manually (Murray et al., 2010). Software systems that read data from practice records have been used to structure clinical data (Lam et al., 2007); these offer potential for directly using clinical data. Irrespective of the method of data recording, it is almost equally important also to collect and record metadata with the file (i.e. data about the data). Such data should contain a general description of the database and how the data were collected and by whom – but also should include definitions of variables (columns), units of measurements (e.g. kilogram, liters, optical density), precision of measurements, date of recording (including time-zone information), interpretation of codes, etc. Metadata might not be necessary for the immediate validity of primary data – but are absolutely crucial for their long-term preservation and use. Recommendations on stewardship of data and other aspects of databases can be found in a report from the US National Academy of Sciences (2009). Validation of primary data either post-execution or during execution, is done by intelligent use of descriptive statistics. Variables measured quantitatively (on either a continuous or a discrete scale), could be evaluated by identification of possible outliers and illustrated e.g. by using a boxplot (also known as a “box-and-whiskers diagram”). Variables measured qualitatively will take only particular values and may be evaluated by using frequency tables to identify “illegal” (or at least, non-plausible) categories or unexpected distributions. Stratified analyses should be used for the evaluation of both quantitative and qualitative variables. Further exploration of the recorded data could be done by making use of graphical illustrations

300

U. Emanuelson, A. Egenvall / Preventive Veterinary Medicine 113 (2014) 298–303

(e.g. bivariate scatter plots). Finally, logical controls could be done to identify impossible combinations of variables (e.g. pregnancy in a male) or impossible time sequences of events (e.g. dogs entering an insurance database before their birth dates). Special attention should be given to missing values – something that may be easily overlooked when “normal” descriptive statistics are used for validation of data. It is almost impossible to avoid missing data in epidemiological research, but it is important to identify whether observations are missing at random or not. Patterns in missingness could be identified e.g. by cross-tabulation of missing values with other variables or within different strata. If data are missing non-randomly, it can cause major biases in the statistical analyses if only complete cases are included. Missing values are not only a potential source of bias – but also will lead to substantial loss of information (especially when multivariable statistical methods are used). In some cases of missingness it might be useful to consider multiple imputation of missing values (e.g. Sterne et al., 2009; Spratt et al., 2010), but there are also alternatives based on full-information maximum-likelihood estimation methods (e.g. Allison, 2012). Validation of primary data is a necessary step in any research, but care should be exerted when “errors” are identified. It is not proper to indiscriminately remove what can be perceived as “outliers” because this would be based on previous experiences that might or might not be applicable in the current situation. Of course, obvious typing errors and errors introduced during transfer of data between different (computer) systems should be corrected. Corrections should always be logged and be reversible e.g. by preserving the original data. A general rule of thumb would be never to make corrections unless there is convincing proof of errors (e.g. biologically impossible); rather, perform multivariable analyses with and without the identified “problematic observations” if proof is lacking but the data still seem questionable. 4. Secondary data Secondary data in research may be defined as “data which have not been collected with the specific research question in mind”. Within veterinary epidemiology, such data might originally have been collected for: (a) management purposes (e.g. Dairy Herd Improvement records); (b) administration (e.g. client records at clinics, claims at insurance companies); (c) surveillance (e.g. sampling at slaughter); or (d) control functions (e.g. animal movements). The main advantage of secondary data is that they are available (and in an increasing amount due to the digitalization of many records). The time and costs for the researcher spent on collecting data are less compared to collecting primary data, and it is therefore usually possible to get large samples of data. Therefore, it might also be possible to sample a large part of the population – thus limiting selection bias. Primary data can be affected by some specific biases (e.g. recall, non-response), because the research question is known to the researcher and the subject; secondary data are less likely to be affected by these particular biases. Secondary data are likely to vary

Table 1 Example of questions that should be considered when assessing the usefulness of secondary data. “Level” Population

Events

Data management

Question How representative are the data of the population targeted for the research? What is the coverage/completeness of the data with respect to the target population? Are codes for e.g. clinical diagnoses precise enough for the purpose of the research? And do clinicians use the codes for the targeted events? Are there several codes that must be identified? Are observations possible to record, e.g. when a new emerging disease appears (e.g. Schmallenberg virus)? Are codes for a specific event constant over time? Do the codes that were used represent the same phenomena through time? What are the sensitivity and specificity of the recordings? How reliable is the data entry/coding? Are staffs properly trained? Is the information complete (e.g. are diseases/diagnoses/characteristics other than the index or most-severe event recorded)? (We note that co-morbidity may be “forgotten” or even impossible to enter into the data-recording system) Is additional (administrative) information (e.g. sex, age, breed) recorded and correct? Can observations be linked to other sources (i.e. is identification of units of observation (subjects) unique and communicable)?

substantially between sources, and perhaps within a source, with respect to features such as correctness. For example, breed is likely to be meticulously recorded in some/all breeding registries – but it will be less accurately recorded in veterinary-practice data or insurance data. Some secondary databases/data sources in Scandinavia have been developed in rather close collaboration with researchers and might therefore have a good data quality a priori. However, the researcher has usually no control – or even information about – the selection process, the quality of the data, or the methods used for collection of information; secondary data therefore are subject to other biases. Also, secondary data might be impossible to validate. Some factors possibly limiting the usefulness of secondary data are implicit in the questions outlined in Table 1. Secondary data should of course be validated by the same methods as primary data (i.e. identification of “nonnormal” observations, etc.; “internal validity”), but it is even more problematic to know what to do with such observations because the researcher usually has no or little control of the recording process. In addition, secondary data need to be examined for their external validity and this can be done in several ways. 4.1. Patient-chart review Possibly the most common way to validate secondary data (from either humans or animals) is to compare data in

U. Emanuelson, A. Egenvall / Preventive Veterinary Medicine 113 (2014) 298–303 Table 2 Terminology in comparing cases found in two (independent) sources of data (Sørensen et al., 1996). Data source 1

Registered cases Non-reg. cases

Data source 2a Registered cases

Non-registered cases

a c a+c

b d b+d

a+b c+d

a Data source 2 is used for evaluation of data source 1. It can be, but seldom is, a “gold standard”.

a database with data that can be found in (manual) patient charts. Such reviews can be performed in either “direction”, i.e. starting with (a sample of) cases identified in the database and tracing them back to the original records or collecting (a sample of) cases from patient records and identifying to what extent and how they are found in the database. In the first case, only correctness can be assessed – but the second can provide information about both correctness and completeness. However, cases not included in the patient records cannot, obviously, be retrieved so the completeness information is still only partial. It is relevant to compare not only the primary information (e.g. a disease recording) between the two sources, but also auxiliary information (such as sex, age, etc.). The latter (in addition to providing useful Information) might also be an indicator of the reliability of the actual primary information. An example of a patient-chart review is the study by Jansson Mörk et al. (2010) where copies of receipts left onfarm from veterinary visits were retrieved from 112 dairy farms either as a photocopy or a digital photo. A simple random sample of the receipts were entered in a database and compared with data in the official disease records of the Swedish Board of Agriculture (in which all veterinary treatments of production animals in Sweden are supposed to be recorded). Discrepancies between the two sources were scrutinized to find reasons, and the completeness of the official database was investigated with a multilevel regression model. In short, the overall completeness for diagnostic events in individual animals was 84% and the completeness was affected by employment type of the veterinarian, region, disease complex, and the random effect of veterinarian. 4.2. Second source of data Another approach to validate secondary data is to use a second source of data in which the whole or part of the target population is registered. Ideally, this second source of data should be independent from the first source. Using different sources of data obviously might cause additional complications such as: (a) codes might not be directly comparable (within or across time); (b) how to handle conflicting or disagreeing dates of events or auxiliary information; and (c) how complete or correct the second source is (i.e. how close to the “truth”). The case data in the two sources would be compared case-by-case, and might be combined as in Table 2 (Sørensen et al., 1996). The degree of completeness in data source 1 (i.e. the proportion of observed disease events that is present in the

301

research database), can then be calculated as a/(a + c), and an estimate of the correctness of data source 1 can be calculated as a/(a + b). If, but only if, data source 2 is a proper gold standard (i.e. very close to the “truth”), the completeness can be considered similar to the sensitivity and the correctness to a positive predictive value of data in source 1. However, if neither source is assumed to contain data of better quality than the other, the term “agreement” might in certain instances be more correct to use, and to calculate the “degree” of agreement (e.g. kappa). Obviously, d in Table 2 cannot be known, because those are cases that are not recorded in either of the two data sources, and therefore the specificity of data source 1 relative to source 2 cannot be calculated. However, it might be possible to estimate d by applying capture–recapture methods which are commonly used in e.g. ecology to sample population sizes but also applied in epidemiological studies (Chao et al., 2001). A simple and “nearly unbiased” estimate of d can then be calculated as b × c/(a + 1) (Hook and Regal, 1992). The approach is based, however, on preconditions such as: sources are independent; there is no heterogeneous catchability within sources; the study population is closed; all cases identified by each source are true cases; etc. (Chao et al., 2001; Gallay et al., 2002). Some of these preconditions might be difficult to meet in practice and it is worth noting that if sources 1 and 2 are positively associated (which is a likely scenario in many circumstances in veterinary epidemiology), d will be underestimated (Hook and Regal, 1992). Using several data sources and more-advanced statistical methods might improve the estimate of d (Chao et al., 2001). If d can be estimated, the total number of cases in the base population can be calculated as a + b + c + d; if the number of individuals in the base population (nBP) is known, the register-based specificity in data source 1 can be calculated as (nBP − (c + d))/nBP (Sørensen et al., 1996). Also, a register-based sensitivity in data source 1 can be calculated as (a + b)/(a + b + c + d). A Nordic project (DAHREVA) on validation of the national official disease-recording systems is one example where this approach has been used. Data source 1 had data from multiple countries’ disease-recording systems and data source 2 had the data from a primary data collection where a total of 580 dairy farms recorded all clinical diseases during two, 2-month periods. The completeness was calculated for several groups of diseases and for each country. The completeness varied with country, disease complex, the origin of the recording (veterinary-treated or farmer-observed disease) and definition of “disagreement” (code and date differences). As an example, the “veterinary treated” completeness ranged between 0.56 and 0.94 for mastitis, between 0.85 and 0.96 for estrus disturbances, between 0.71 and 0.88 for milk fever, and between 0.33 and 0.88 for locomotor disorders (Espetvedt et al., 2012; Lind et al., 2012; Wolff et al., 2012; Rintakoski et al., 2013). However, no attempt to estimate d was made in this project, because the potential data sources used for a capture–recapture design would have been highly positively associated.

302

U. Emanuelson, A. Egenvall / Preventive Veterinary Medicine 113 (2014) 298–303

4.3. Two-stage sampling Two-stage sampling has been suggested as a way to evaluate the properties of a diagnostic test when it is difficult to apply the gold-standard test on all subjects because of costs or ethical considerations (McNamee, 2002). The same approach could be applied also for validation purposes, where the secondary data source can be viewed as the diagnostic test. In a two-stage sampling design a subsample is selected, which subsequently is carefully investigated with respect to exposure and outcome variables. The approach is similar to a patient-chart review, but an important distinction is that both cases and controls (alternatively, both exposed and non-exposed subjects, depending on the relevant project’s study design) should be included (either through purposive or random sampling, perhaps done within strata defined by variables measured in the data source). The optimal sampling strategy depends on the structure of the data and on the property to be evaluated (McNamee, 2002). Diagnostic accuracy (and thus also data validity) from two-stage sampling can be estimated for instance by semilatent models (Albert and Dodd, 2008) or data-imputation methods (Albert, 2007). However, it might not always be easy to apply this approach for validation of secondary data, because the true status of the subjects included is already “history” and cannot easily be retrieved. 4.4. Aggregate methods A crude way to validate secondary data in some cases could be to use information about the number of “cases” that could be expected to be present. The expected number can either be based on the total number that has been found in other comparable sources, predictions based on estimated associations from epidemiological studies of populations that are similar to the one considered, or based on simulations (Goldberg et al., 1980; Sørensen et al., 1996). 5. Is validation important? The “validity of data” is related to their specific use; i.e. valid data indicate that they are valid to be used for an intended purpose. So, validity of primary data is absolutely crucial, because it would not be fit for the intended research otherwise; all efforts should be spent to produce valid data. However, random missingness or misclassification might be less important than systematic errors, because the former affects the precision of estimated frequencies or associations while the second produces biased estimates. Garbage data will lead to bias – but bias is the topic of another contribution to this special issue. Secondary data might well be valid for their intended (primary) use, but the validity as secondary data depends on how and for what purpose they are used. It is obvious that secondary data must be correct – but the required degree of completeness depends on the context. If secondary data are used, for instance, to estimate the frequency of a disease, then the data have to be complete. However, if they are used to compare frequencies over time or between different populations, then the data do not necessarily have to

be complete; however, the missingness must be constant over time and not dependent on e.g. characteristics of the populations. The usefulness of secondary data needs to be evaluated carefully with respect to their intended use from one research project to another. 6. Conclusion Another interpretation of GIGO is “garbage in, gospel out” with the meaning that we may put excessive trust in anything that a computer may produce. This interpretation makes it even more important to validate the data that are used for epidemiological studies carefully, because we base our conclusions on extensive statistical analyses of the observations. There is much that can be done to validate both primary and secondary data sources and it is really difficult to find good excuses why this is not done to a larger extent. Conflict of interest statement None of the authors of this paper has a financial or personal relationship with other people or organizations that could inappropriately influence or bias the content of the paper. Acknowledgments The inspiration provided by Ian Dohoo to the authors, and to many of their and other’s PhD-students, is gratefully acknowledged. Not the least is how to conduct and analyze observational studies according to “best epidemiological practice” – including validation of data and statistical models. References Albert, P.S., 2007. Imputation approaches for estimating diagnostic accuracy for multiple tests from partially verified designs. Biometrics 63, 947–957. Albert, P.S., Dodd, L.E., 2008. On estimating diagnostic accuracy from studies with multiple raters and partial gold standard evaluation. J. Am. Stat. Assoc. 103, 61–73. Allison, P.D.,2012. Handling missing data by maximum likelihood. SAS Institute Inc. 2012. In: Proceedings of the SAS® Global Forum 2012 Conference. SAS Institute Inc, Cary, NC (Paper 312-2012). Anonymous, 2013. Garbage in, garbage out. Wikipedia, http://en. wikipedia.org/wiki/Garbage in, garbage out (accessed 07.02.13). Chao, A., Tsay, P.K., Lin, S.-H., Shau, W.-Y., Chao, D.-Y., 2001. The applications of capture–recapture models to epidemiological data. Statist. Med. 20, 3123–3157. Dufour, S., Barkema, H.W., DesCoteaux, L., DeVries, T.J., Dohoo, I.R., Reyher, K., Roy, J.-P., Scholl, D.T., 2010. Development and validation of a bilingual questionnaire for measuring udder health related management practices on dairy farms. Prev. Vet. Med. 95, 74–85. Espetvedt, M.N., Wolff, C., Rintakoski, S., Lind, A.-K., Østerås, O., 2012. Completeness of metabolic disease recordings in Nordic national databases for dairy cows. Prev. Vet. Med. 105, 25–37. Gallay, A., Nardonei, A., Vaillant, V., Desenclos, J.C., 2002. The capture–recapture applied to epidemiology: principles, limits and application. Rev. Epidemiol. Sante 50, 219–232. Goldberg, J., Gelfand, H.M., Levy, P.S., 1980. Registry evaluation methods: a review and case study. Epidemiol. Rev. 2, 210–220. Hook, E.B., Regal, R.R., 1992. The value of capture–recapture methods even for apparent exhaustive surveys. Am. J. Epidemiol. 135, 1061–1067. Jansson Mörk, M., Wolff, C., Lindberg, A., Vågsholm, I., Egenvall, A., 2010. Validation of a national disease recording system for dairy cattle against veterinary practice records. Prev. Vet. Med. 93, 183–192.

U. Emanuelson, A. Egenvall / Preventive Veterinary Medicine 113 (2014) 298–303 Lam, K., Parkin, T., Riggs, C., Morgan, K., 2007. Use of free text clinical records in identifying syndromes and analysing health data. Vet. Rec. 161, 547–551. Lind, A.-K., Thomsen, P.T., Ersbøll, A.K., Espetvedt, M.N., Wolff, C., Rintakoski, S., Houe, H., 2012. Validation of Nordic dairy cattle disease recording databases – completeness for locomotor disorders. Prev. Vet. Med. 107, 204–213. McNamee, R., 2002. Optimal designs of two-stage studies for estimation of sensitivity, specificity and positive predictive value. Statist. Med. 21, 3609–3625. Murray, R.C., Walters, J.M., Snart, H., Dyson, S.J., Parkin, T.D.H., 2010. Identification of risk factors for lameness in dressage horses. Vet. J. 184, 27–36. National Academy of Sciences, 2009. Ensuring the integrity, accessibility, and stewardship of research data in the digital age. National Academies Press, Washington, DC, pp. 162 (ISBN-10: 0309136849). Rintakoski, S., Wolff, C., Espetvedt, M.N., Lind, A.-K., Kyyrö, J., Taponen, J., Peltoniemi, O., Virtala, A.-M., 2013. Completeness of the national

303

disease registers for dairy cows regarding reproductive disturbances in Denmark, Finland, Norway and Sweden (under revision). Sørensen, H.T., Sabroe, S., Olsen, J., 1996. A framework for evaluation of secondary data sources for epidemiological research. Int. J. Epidemiol. 25, 435–442. Spratt, M., Carpenter, J., Sterne, J.A.C., Carlin, J.B., Heron, J., Henderson, J., Tilling, K., 2010. Strategies for multiple imputation in longitudinal studies. Am. J. Epidemiol. 172, 478–487. Sterne, J.A.C., White, I.R., Carlin, J.B., Spratt, M., Royston, P., Kenward, M.G., Wood, A.M., Carpenter, J.R., 2009. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 339, 157–160. Wolff, C., Espetvedt, M.N., Lind, A.-K., Rintakoski, S., Egenvall, A., Lindberg, A., Emanuelson, U., 2012. Completeness of the disease recording systems for dairy cows in Denmark, Finland, Norway and Sweden with special reference to clinical mastitis. BMC Vet. Res. 8, 131.

The data - sources and validation.

The basis for all observational studies is the availability of appropriate data of high quality. Data may be collected specifically for the research p...
452KB Sizes 0 Downloads 0 Views