Featured Article Received 30 March 2014,

Accepted 7 May 2015

Published online 5 June 2015 in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/sim.6543

Individual privacy versus public good: protecting confidentiality in health research Christine M. O’Keefea*† and Donald B. Rubinb Health and medical data are increasingly being generated, collected, and stored in electronic form in healthcare facilities and administrative agencies. Such data hold a wealth of information vital to effective health policy development and evaluation, as well as to enhanced clinical care through evidence-based practice and safety and quality monitoring. These initiatives are aimed at improving individuals’ health and well-being. Nevertheless, analyses of health data archives must be conducted in such a way that individuals’ privacy is not compromised. One important aspect of protecting individuals’ privacy is protecting the confidentiality of their data. It is the purpose of this paper to provide a review of a number of approaches to reducing disclosure risk when making data available for research, and to present a taxonomy for such approaches. Some of these methods are widely used, whereas others are still in development. It is important to have a range of methods available because there is also a range of data-use scenarios, and it is important to be able to choose between methods suited to differing scenarios. In practice, it is necessary to find a balance between allowing the use of health and medical data for research and protecting confidentiality. This balance is often presented as a trade-off between disclosure risk and data utility, because methods that reduce disclosure risk, in general, also reduce data utility. Copyright © 2015 John Wiley & Sons, Ltd. Keywords:

confidentiality; privacy; biostatistics; medical research; health care research

1. Introduction The use of health and medical data in research has many benefits, including enhanced healthcare experiences and outcomes for individuals, expanded knowledge about disease and appropriate treatments, strengthened understanding of the effectiveness and efficiency of healthcare systems, support for public-health policy development and quality and safety improvements [1, 2]. As only one example, a recent study of linked health and medical data in Western Australia found that mothers with mental illness had an increased risk of complications in pregnancy and birth [3]. After additional research, a toolkit for identifying factors that required intervention, and for identifying associated intervention programs, was launched in Western Australia in 2008 [4]. Traditionally, health and medical research data sets have been generated from surveys on participants who have given consent for that research. One example is the Framingham Heart Study, which began in 1948 and has recruited participants from Framingham, Massachusetts, USA, in several waves [5]. Under the direction of the National Heart, Lung and Blood Institute, the Study aims to identify the common factors or characteristics that contribute to cardiovascular disease. External investigators can propose research projects using the Study’s data and biological specimens that are made available in accordance with the individual consent history of each participant and ethics oversight. The increasing use of information technology in health care has greatly increased the amount of health data available, including routinely collected administrative health data as well as clinical data. In addition, the volume and range of molecular data being collected are increasing at an unprecedented rate, including

MA 02138-2901 U.S.A.

*Correspondence to: Christine M. O’Keefe, CSIRO Digital Productivity Flagship, GPO Box 664 Canberra 2602, ACT Australia. † E-mail: [email protected]

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

3081

a CSIRO Digital Productivity Flagship, GPO Box 664 Canberra 2602, ACT Australia b Department of Statistics, Harvard University Science Center, 1 Oxford Street Cambridge,

C. M. O’KEEFE AND D. B. RUBIN

3082

genomic and proteomic data. These data can be used alone for research studies, but there is also enormous potential for combining them with survey data to enable a wider range of questions to be addressed. In response to the growing demand from researchers for such combined datasets, data linkage units have been established in some countries. In these linkage units, diverse datasets for the whole or part of a population are linked and made available for research, without the consent of the participants. Examples of such data linkage initiatives include the Manitoba Centre for Health Policy and Evaluation, Canada [6], the Oxford Medical Record Linkage System [7], the Scottish Medical Record Linkage System [8], Western Australian Data Linkage [9] and the Welsh Secure Anonymised Information Linkage (SAIL) system [10]. Along with the greater availability of health data for research, there is also a trend of growing emphasis on data sharing in research. For example, in the Final NIH statement on sharing research data [11], the National Institutes of Health (NIH) reaffirmed their support for the concept of data sharing as being essential for expedited translation of research results into knowledge, products and procedures to improve human health. The NIH adopted policies that mandate sharing data generated or studied with US federal funding [11, 12]. Samet [13] discussed data sharing in the context of new NIH-funded population studies, noting that: ‘their data sets will be massive and afford myriad opportunities for developing and testing hypotheses, ...’. The editors of the journal Epidemiology recognise the importance of independent replication of published analyses; see Hernan and Wilcox [14]. Noting the difficulties of direct replication in epidemiology, they suggest that data sharing would at least enable comparisons of different approaches, sensitivity analyses to help identify crucial assumptions and the discovery of data inconsistencies. The editors invite authors to consider providing analytic code used for the analysis of publicly available data, as well as code used to develop and analyse simulation data. Historically, data collection and analysis for research have been conducted internally by the same organisation, but the sharing of data between organisations has become a major component of emerging biomedical information systems [1, 2]. The sharing of health and medical data highlights confidentiality and privacy implications, which are perhaps even more critical in studies using data without participants’ consent. Health and medical data are generally considered to be among an individual’s most private and sensitive personal information. In addition, because many diseases and conditions have a hereditary component, an individual’s health information is also part of the health information of family members and so is sensitive and private to them as well. A number of recent studies have investigated the likelihood of disclosure of health information across a range of scenarios. Sweeney [15] studied the risks inherent in patient-level health data made publicly available by the State of Washington for a small fee ($50 at the time). The data cover almost all hospitalisations for a given year and include patient demographics, diagnoses, procedures, attending physician, hospital, a summary of charges and how the bill was paid. The data do not include patient names or addresses, although ZIP codes are included. Sweeney showed that the data could be compared with newspaper stories to uniquely re-identify 35 of the 81 cases in 2011, with matches being verified by a news reporter contacting patients. A recent analysis of Genome-Wide Association Studies found that 6.8% had the unforeseen potential for misuse by exposing an individual research participants’ information, including revealing disease status, predicted future likelihood or past presence of other traits or by linking another DNA result with a participant, for example to determine presence or absence in a research cohort, ancestry and relatedness (e.g. paternity/non-paternity) [16]. In another study, Gymrek et al. [17] were able to infer the identity of a small number of male participants in public sequencing projects by linking Y-chromosome markers with available user-provided information (e.g. surname) from recreational ancestry databases. Additional public databases were used to narrow down the list of possible individuals with surnames segregating with the Y-chromosome markers. ‘The damaging effects brought on by a breach in the security of this information are endless. Third parties–employers, bankers, neighbours–could use this information to discriminate against and potentially ostracize an individual diagnosed with an “unpopular” disease or condition’ [18]. Recent breach notification laws in the USA have alerted the public to the scope and frequency of incidents including use of personal information for identity fraud, disclosure of millions of records through lost and stolen unencrypted mobile devices and a healthcare provider snooping on the medical records of famous people, family and friends [19]. A breach of confidentiality could potentially lead to improper treatment such as reluctance to obtain medical care, loss of health insurance or even loss of employment. Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

C. M. O’KEEFE AND D. B. RUBIN

In summary, the trends of increasing emphasis on population studies and data sharing in research, together with growing community awareness and concern, underline the need for strong confidentiality protection for health data in research. It is the purpose of this paper to provide a review of a number of approaches to reducing disclosure risk when making data available for research, and to present a taxonomy for such approaches. The rest of the paper is structured as follows. Section 2 provides background and related information. Section 3 gives a structured overview of a number of approaches for reducing disclosure risk when making data available for research, with examples. Section 4 describes a taxonomy for the methods in terms of a conceptual model for analysis and two overarching paradigms. This section also describes the features of the various approaches, intended to assist with selecting the appropriate method for a specific situation. A discussion is provided in Section 5.

2. Preliminaries 2.1. Overview of some relevant privacy regulation Privacy and related legislation in many countries around the world is voluminous and complex, particularly as it pertains to health data. In this section, it is not our intention to provide an in-depth review and commentary, but rather to give an extremely high-level indication of the current state of the most relevant legislative instruments in the USA, Europe and Australia. Health research organisations and ethical review committees are excellent sources of information on relevant local researcher obligations. 2.1.1. USA. The Health Insurance Portability and Accountability Act (HIPAA) [20] aims, among other goals, to improve portability and continuity of health insurance coverage in group and individual markets and to simplify the administration of health insurance. The HIPAA Privacy Rule describes three standards for the disclosure of health information without consent, namely Safe Harbor, Limited Dataset and Statistical Standard. Safe Harbor requires the removal of 18 specified identifying variables from a dataset in order to consider it to be de-identified, so that it can be disclosed for use in research. Limited Dataset requires the removal of only 16 specified identifying variables, and also requires a data-sharing agreement between the custodian and researcher. The Statistical Standard requires an expert to certify that ‘the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information’. The de-identification strategy is important because it enables patients’ health data to be disclosed without oversight by the regulatory authorities, because the risk of identification is deemed to be very low. In 2012, the US Federal Government published guidance for de-identification of protected health information under HIPAA [21], including clarification of the meaning of de-identification and how it can be achieved in practice. McGraw [22] reports on a Center for Democracy and Technology stakeholders’ workshop gauging support for the HIPAA de-identification strategy and discussing policy proposals to address concerns and improve trust in the process. 2.1.2. Europe. In 1995 the European Union passed a Data Protection Directive (Directive 95/46/EC) on the protection of individuals with regard to the processing of personal data and on the free movement of such data [23], which sets strict limits on the collection and use of personal data and requires that each Member State set up an independent national body responsible for the protection of these data. The Directive establishes that sensitive data (including data concerning health) can only be processed with the explicit consent of the individual, except in specific cases such as where there is significant public interest (such as in medical or scientific research) for which other safeguards have been established as an alternative to obtaining consent. It has been left to Member States to decide how they assess the significance of the public interest, and how extensive the legislative exemptions for research might be. The European Parliament is currently developing a General Data Protection Regulation to replace the Data Protection Directive. To date, more than 3000 amendments have been proposed to the Regulation in the European Parliament?s Committee on Civil Liberties, Justice and Home Affairs (LIBE). The draft report of the LIBE committee was released in January 2013 [24].

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

3083

2.1.3. Australia. The Privacy Act 1988 (Privacy Act) regulates the collection, holding, use, correction, disclosure and transfer of personal information by covered organisations. The Privacy Act gives effect to Australia’s agreement to implement the Organisation for Economic Cooperation and Development

C. M. O’KEEFE AND D. B. RUBIN

Guidelines on the Protection of Privacy and Transborder Flows of Personal Data, as well as to its obligations under Article 17 of the International Covenant on Civil and Political Rights. The Privacy Amendment (Enhancing Privacy Protection) Act 2012 (Privacy Amendment Act) introduced many significant changes to the Privacy Act from March 2014, and is a part of the privacy law reform process that began in 2006. The Government’s response to the reform recommendations supports two central proposals to facilitate research in the public interest: simplification of regulation and protection of community expectations of privacy. A harmonised set of rules for Government and private sector researchers will replace the two previous sets of binding guidelines on non-consensual handling of personal information, and the research provisions will be expanded to allow such handling for any research in the public interest, not just for health and medical research. In addition, two important parameters of the current regime will also be maintained. The public interest in research must substantially outweigh the protection of privacy, requiring a clear choice in favour of the research, and the National Health and Medical Research Council and the Privacy Commissioner will retain primary responsibility for issuing and approving the research rules. Each Australian state and territory also regulates the management of personal information [25, 26]. In addition, New South Wales, Victoria and the Australian Capital Territory all have legislation that regulates the handling of personal health information in the private sector. 2.2. Types of data Microdata refer to datasets in which each record is contributed by an individual in the population, so that the record typically comprises values of a number of variables for that individual. A variable is numerical if its values can be measured as numbers and is categorical if its values fall into categories. Further, a numerical variable is discrete if its set of possible values is countable and is continuous if its set of possible values is uncountable. A categorical variable is nominal, ordinal or hierarchical if its categories cannot be organised in a logical sequence, have a natural order or have a natural hierarchy, respectively. An individual may contribute more than one record in microdata, for example if the data are time-stamped hospital event data. Tabular data result when microdata are summarised and presented as a table, with axes corresponding to explanatory variables and cells corresponding to a response variable. Table cells can contain counts, where each data record contributes 1 to its tabulation cells and 0 to all other cells; in which case, the data are called tabular count data, and the table is called a contingency table. The counts can also be scaled and presented as frequencies, proportions or percentages, giving rise to tabular frequency data, proportion data or percentage data, respectively. Table cells can also contain aggregates of one response variable, for example the total or average value of that variable for individuals contributing to that cell, in which case the data is called magnitude data. Tabular data can be a data product itself, or can be an input into a more sophisticated analysis. Although tabular data are generally aggregated microdata, it cannot be assumed that this aggregation is sufficient to protect privacy. For example, if a table cell value is 1, then it is known that there is only one individual with the combination of values that describes the cell. Once a person with that combination of values is found, then an individual has been identified. 2.3. Types of disclosure risk We consider the threats to an individual resulting from a data release, where the term disclosure is used to refer to an inappropriate attribution of information to an individual. There are two basic types of disclosure, namely identity and attribute disclosure [27]. An Identity Disclosure occurs if an individual is identifiable from the data release. An Attribute Disclosure occurs when the released data make it possible to infer the characteristics of an individual more accurately than would have otherwise been possible. The main ways that an identity disclosure can occur are as follows:

3084

• Release of identifying information • Spontaneous recognition–where an individual is sufficiently unusual in a data collection, or the data user knows sufficiently many attributes of an individual, so that the individual can be recognised from normally non-identifying attributes. This may occur if the attributes have extreme values such as extreme old age or an unusual combination of attributes. For example, it is generally accepted that households have distinctive patterns of inhabitants and other features that make them vulnerable to spontaneous recognition. Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

C. M. O’KEEFE AND D. B. RUBIN

• Matching to another data base–where combinations of so-called key variables in the data occur in other databases sufficiently rarely that data matching reveals identity. Attribute disclosure is usually achieved through identity disclosure; an individual is first identified through some combination of variables, and then disclosure of values of other variables included in the released data follows. Attribute disclosure can nevertheless occur without an identification. For example, suppose a data set contains residential, occupational and income information and supports the inference that all individuals in a given area with a given occupation have roughly similar incomes [28]. Then any individual in that residential area and with that occupation is highly likely to have that income, thus, an attribute disclosure has occurred without identification of any individual in the data. The last example is also an example of a group disclosure, where there is an inappropriate attribution of information to a large percentage of an identifiable group [29]. Data considered to be especially vulnerable to disclosure include the following [30]: • • • • •

Data including geographical detail Longitudinal panel data Data containing variables with extreme values Data including ‘key variables’ also available in other databases accessible to a potential attacker Data with population unique variable values

In general, population data are more vulnerable to disclosure than sample data because individuals are known to be represented in population data with certainty, but usually not in a sample. Disclosure risk can further depend on the existence of microdata records that are both unique in the sample and in the population on a set of potentially identifying cross-classified key variables, because such records can be matched to external datasets with high confidence [31]. 2.4. Balancing disclosure risk with data utility

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

3085

The balance between protecting confidentiality and allowing the use of health and medical data for research has been represented as a trade-off between disclosure risk and data utility [27], where disclosure risk attempts to capture the probability of a disclosure of information resulting from a data release, while data utility attempts to capture some measure of the usefulness of the released data. Confidentiality methods are technical approaches designed to reduce disclosure risk and are applied in addition to governance and information security measures. However, any confidentiality method will also reduce data utility. For this reason, data confidentialisation is often carried out by highly skilled and experienced practitioners within a data custodian agency, building on a long tradition of successful data releases. The methods used are often highly tailored to the nature of the data, as well as the agency environment including applicable legislation and confidentiality commitments made. The idea of balancing risk and utility advanced by Duncan et al. [27] is that in a specific situation, the data custodian creates a Risk-Utility (or R-U) Map as a two-dimensional plot of disclosure risk versus data utility for various parameter instances of a range of confidentiality methods and chooses the method and parameter instance with the maximum utility given a maximum tolerable risk. Cox et al. [32] have argued that such risk-utility paradigms are useful for posing questions, but much less so for actually choosing confidentiality methods for statistical disclosure limitation. One of the key gaps is the general lack of widely accepted measures of disclosure risk and data utility, and consequently, a range of assessments of disclosure risk and data utility have been developed for specific situations; see for example [33, 34]. A notable example of a proposed risk standard is differential privacy, which seeks to formalise the notion of privacy in the context of algorithms performed on confidential information, including statistical analysis of confidential datasets via remote analysis, [35, 36]. An algorithm is differentially private if it satisfies a formal condition, essentially requiring that when applied to any two datasets that differ in a single element, the algorithm will give similar answers. The idea behind this definition is that if this condition is satisfied, then the amount of information that can be learned from an individual data item is similarly limited, and the assumption is that this implies a low risk of identity or attribute disclosure. This idea has been challenged by Kifer and Machanavajjhala [37], who propose that differential privacy might be better interpreted as a measure of ‘evidence of participation’; for an alternative disclosure risk measure based on differential privacy, see [38].

C. M. O’KEEFE AND D. B. RUBIN

3. Methods for reducing disclosure risk The task of balancing the research use of health and medical data with confidentiality protection can usefully be viewed as a risk management exercise involving both risk assessment and risk mitigation, while maintaining acceptable utility. A first obvious step in reducing disclosure risk is to remove identifying information such as name and address from the dataset. However, this is often insufficient to ensure that individuals represented in a data set cannot be re-identified–as mentioned in Section 2.3, such data can be vulnerable to disclosure via spontaneous recognition or data matching. 3.1. Prior related work While some early references on confidentiality addressed the issue of protecting respondent confidentiality as researchers began to request data files for research [39, 40], others studied an individual approach, such as noise addition [41], or addressed disclosure risk more broadly [42–44]. Adam and Wortmann [45] were the first to attempt a detailed assessment of statistical disclosure control methods, and Fellegi [46] is one of the few early papers discussing the trade-off between societal benefits and costs of data sharing. In 1993, the Journal of Official Statistics responded to the observed dual trend of increasing nonresponse and increasing confidentiality concerns in government surveys with a Special Issue reporting the outcomes of a Conference on Disclosure Limitation Approaches and Data Access, held in March 1991 [47]. Confidentiality remains a major concern for national statistical agencies [28, 48], as well as for a broader range of agencies and organisations, which now find themselves holding significant data archives and receiving access requests from researchers. In recognition of the importance of biomedical data privacy, the Journal of the American Medical Informatics Association recently published a special issue on the topic [49]. Fortunately, the methods developed for national statistical agencies are also applicable for healthcare organisations and health administration agencies holding biomedical and health-related datasets. For example, in 2001, Fienberg [50] provided an introduction and overview of some statistical disclosure limitation issues and methods of special relevance to public health studies and surveys. These included perturbation methods and ‘simulated’ public-use data, especially in the context of contingency tables. In 2008, O’Keefe [51] reviewed technological approaches to the problem of balancing access and use of health data with confidentiality protection and gave examples of current large-scale implementations of the approaches: de-identification, statistical disclosure limitation and remote analysis. Reiter and Kinney [52] in 2011 presented a brief primer on techniques for sharing confidential data, in response to the Epidemiology journal’s promotion of data sharing for reproducible research. There have been a number of contributions to developing a framework for statistical disclosure limitation [45, 53, 54]. We expand and update these early taxonomies in Section 4 in the following. 3.2. Types of methods In the remainder of this section we provide a structured overview of approaches to reducing disclosure risk when making data available for research. Importantly, each approach only addresses the disclosure risk inherent in the data, and so each must be implemented within an appropriate legislative and policy environment and governance structure, and with user community management and Information Technology (IT) security, including user authentication, access control, system audit and follow-up. The approaches have different strengths and weaknesses, and so none dominates the others in all data access scenarios. In fact, because there is a range of scenarios, it is desirable to have a range of disclosure risk reduction approaches, so the most appropriate one can be chosen to meet the requirements of a particular situation involving a particular dataset, data custodian, analyst and so on. Traditionally, there have been two different general approaches with regard to enabling the use of data while protecting confidentiality [53]: • restricted or limited access, wherein the access to the information is restricted; and • restricted or limited information, wherein the amount or format of the information released is restricted.

3086

Often, these two approaches are used in combination, such as when access to data is restricted to approved analysts, and the data themselves have had identifying information removed and/or dates aggregated to months or years. The relationship between the degree of access restriction and the degree of information Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

C. M. O’KEEFE AND D. B. RUBIN

restriction required is perhaps best represented in the framework of Marsh et al. [55], who noted that a successful disclosure involves first an attempt at disclosure, then success of that attempt. In probabilistic terms, this is Pr(disclosure) = Pr(attempt) ⋅ Pr(disclosure ∣ attempt). Restricted access seeks to reduce Pr(attempt), while restricted data seeks to reduce Pr(success ∣ attempt). Restricted access and restricted information methods are described in Section 3.3 and in Sections 3.4 and 3.5, respectively. Many of these approaches, as well as firm theoretical foundations for them, are still under development in the statistical disclosure control research and practice community. Nonetheless, some large-scale implementations have been successful, and we give examples of each of the approaches in practice, focusing, where possible, on health and medical applications. 3.3. Restricted access methods In this section, we discuss various data access strategies used to restrict access to information. These are predominantly implemented as system designs. We present these in generally increasing order of restriction, so that the degree of requirement for information restriction generally decreases correspondingly. At opposite ends of this spectrum are the familiar data access strategies of providing no information to non-authorised users and full information to fully authorised users. In this section, we focus on methods between these two extremes. 3.3.1. User agreements for offsite use. Under this approach, sometimes called Licensing, users are required to register with a custodian agency, and sign a user agreement, before receiving data to be analysed offsite. Typically, such agreements specify restrictions on the user, such as restrictions on the manner of storage and further dissemination of the data, as well as prohibiting attempts to re-identify data records. Such agreements also typically specify sanctions for breaches and are legally binding. The user community is managed by the custodian, including, possibly, the use of external audits to verify compliance with the restrictions in the agreement. Examples of this approach include the many Public Use Files disseminated by organisations and agencies, including national statistical and health agencies; see [56–58].

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

3087

3.3.2. Remote analysis systems. In remote analysis, the analyst submits statistical queries through an interface, analyses are carried out on the original data in a secure environment and the user then receives the (confidentialised) results of the analyses [59,60]. In particular, the analyst does not receive any data at all, but only analysis results. Because analysis results can reveal information about the underlying data, the output needs to be confidentialised. As early examples of remote analysis, table servers have been developed by the National Institute of Statistical Science to disseminate marginal sub-tables of a large contingency table [61, 62]. Users submit requests for marginal sub-tables of the full tables to the server, where each sub-table is specified by the variables that it contains. Potential responses to the user include the requested table, a projection of it, an otherwise modified version of the requested table or a notice of refusal, where the choice of response is made to minimise the disclosure risk. In static mode, the set of allowable responses for a given table is pre-computed. In dynamic mode, the response to each query depends on the queries that the server has previously answered, in that the disclosure risk of a given query is assessed in the light of responses given to previous queries. One problem with dynamic table servers is that in some cases, there can be no allowable response left after only a few releases. The Australian Bureau of Statistics (ABS) Remote Access Data Laboratory (RADL) is a secure online data query service that clients can access via the ABS web site. Authorised users use the web interface to submit queries in the SAS, Stata or SPSS language to be run on confidentialised data that are kept within the ABS environment. The results of the queries are checked for confidentiality then made available to the users via their desktop computers [63]. The RADL system enables analysis of Australian health datasets including: the National Health Survey, the Disability, Ageing and Carers survey and the National Aboriginal and Torres Strait Islander Health Survey. The ABS has recently developed the TableBuilder and DataAnalyser remote analysis systems with automated confidentiality routines that allow uses to build their own custom tables or undertake regression analyses on secured ABS microdata [64]. The Microdata Analysis System under development by the US Census Bureau will allow users to receive certain statistical analyses of Census Bureau data, including regression analyses, without ever having access to the data themselves [65].

C. M. O’KEEFE AND D. B. RUBIN

We remark that remote analysis systems need to be protected against attacks including massively repeated queries, subsetting to create very small datasets and never-ending loops. Recently developed systems do not allow user-submitted code but rather implement a menu-driven interface to reduce the risks from these and other types of attack. Dwork and Smith [36] have noted that differential privacy is applicable in an interactive context in which an adversary submits a series of queries, which would include a remote analysis system. However, this approach has not been adopted in the systems mentioned earlier. To date, the most widely applicable method of achieving differential privacy is to add Laplace-distributed random noise to algorithm outputs [36]. A significant challenge with differential privacy is that some of the known differentially private algorithms provide output of low statistical usefulness; see for example [37, 66]. For a recent application of differential privacy to the problem of reducing disclosure risk for tabular data; see [67]. 3.3.3. Virtual data centres. Virtual data centres are similar to remote analysis systems, except that the user has full access to the data [68], and are similar to on-site data centres, except that access is over a secure link on the internet from the researcher’s institution. An example of a virtual data centre is the US NORC Data Enclave, which provides a protected environment within which authorised social science researchers can remotely access confidential microdata [69]. In the NORC Enclave, researchers do not have access to the internet and cannot move files into or out from the secure environment without approval. Any export request from a researcher is scrutinised by an NORC statistician to ensure that it does not contain disclosive data. If there are any disclosure concerns, the researcher is notified and the output is not released. If no concerns exist, the output is cleared and uploaded to a transfer site from where the researchers can download the output. Another interesting example is the Australian Population Health Research Network Secure Unified Research Environment (SURE) [70]; see [71]. SURE was developed by the Sax Institute for use by population health and health services researchers to analyse complex, linked administrative health and related data sets provided by a range of data custodians. Files including analysis outputs can only be exported from SURE through a curated gateway, where they are assessed for confidentiality risk and treated with confidentialisation measures if necessary. Ensuring confidentiality protection is currently the responsibility of the study’s chief investigator. All SURE users participate in mandatory training including training in privacy and confidentiality regulatory regimes, and in the principles of statistical disclosure control for protecting confidentiality. Similar systems include the United Kingdom Office For National Statistics Virtual Microdata Laboratory [72] and the UK Secure Data Service, which provides secure remote access to data operated by the Economic and Social Data Service [73]. 3.3.4. Secure on-site data centres. Many national statistical agencies allow researchers access to confidential data in secure, on-site research data centres. Usually, the data have undergone a confidentialisation process such as de-identification and some light statistical disclosure control, but have more detail than datasets confidentialised for release to researchers. Analysts are generally not restricted in the analyses they can perform and the intermediate results they can generate and view. However, only results that have been checked to ensure low disclosure risk, or which have been confidentialised if necessary to reduce disclosure risk, can be removed from the laboratory. Currently, this output checking is performed manually, as in the guidelines in [48]. Under these guidelines, analysis outputs are classified as either safe or unsafe. Safe outputs are those which the researcher should expect to have cleared for release with minimal or no changes, while an output judged unsafe will not be cleared unless the researcher can demonstrate that the particular context and content of the output makes it non-disclosive. Examples of on-site data centres include the US Census Bureau Research Data Centres [74] and the ABS On-site Data Laboratory [75]. 3.4. Restricted information methods–microdata

3088

A microdata record is generally considered to have high disclosure risk if it has an unusual combination of characteristics or if it is unique in the sample and in the population; see [76]. Geographic and timestamped information is regarded as being at particularly high risk, because each location or time is unique. Restricted information methods normally comprise the application of some statistical disclosure limitation techniques; see [28, 48]. Statistical disclosure control techniques can be perturbative or nonperturbative. Perturbative methods operate by modifying the data values, whereas non-perturbative Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

C. M. O’KEEFE AND D. B. RUBIN

methods do not modify the data values. Perhaps the most well-known perturbative method is the addition of random ‘noise’ to a dataset, and perhaps the most well-known non-perturbative method is suppression of record or variable values. In this section, we describe the main techniques developed for microdata. The methods are presented in the order of generally increasing restriction on released information, so decreasing disclosure risk, from removal of identifying information to synthetic data. The amount of trust in the analyst therefore generally decreases across the methods, and so access restrictions may also be able to be relaxed across the methods. 3.4.1. Removal of identifying information. Probably the most common method of reducing disclosure risk in data sets is to remove identifying information such as name, address, date of birth and unique identifiers such as social security number or healthcare identifier. This is often called de-identification. Unfortunately, the meaning of the term de-identification is not always clear. It appears that it is sometimes used to mean that individuals represented in the data cannot be identified from the data; however, as discussed earlier, this may be misleading; for a more detailed discussion of the issue of de-identification, particularly in the context of the Australian National Statement on Ethical Conduct in Human Research [77] and the US Health Insurance Portability and Accountability Act 1996 [20], see [51]. In most jurisdictions, privacy legislation provides for the use of de-identified health and medical data for research, without patient consent. Some recent articles in the medical, legal and computer science literature have argued that de-identification methods are easy to reverse and therefore do not provide adequate protection; see for example [78, 79]. Rothstein [80] considers that by itself, the strategy of deidentifying health records and biological specimens is insufficient to protect privacy in research. However, a recent systematic review of re-identification attacks on health data found that the evidence is insufficient to draw conclusions about the efficacy of de-identification methods [81]. As examples, the Population Health Research Network [82] will enable existing Australian health data to be brought together and made available for health and health-related research purposes under protocols that use linkage keys to replace personal information in health records. Similarly, the University of British Columbia Centre for Health Services and Policy Research [83] is the central access point for researchers wishing to obtain and use health data in de-identified format for research in the public interest. 3.4.2. Non-perturbative methods for microdata.

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

3089

Suppression of variables or variable values Entire records or entire variables, such as name of surgeon in clinical data, can be suppressed. It is also possible to suppress certain values of categorical individual variables, where such a value is sufficiently unusual that it leads to unacceptable risk of disclosure via matching. Variable Recoding A widely used method for reducing disclosure risk is variable recoding, or coarsening, that can be either part of the data collection design phase or applied to the resulting dataset. The method can be applied to either tabular data or microdata, and can be applied to any number of the variables. Variable recoding usually involves reporting the values of the variable with less than full detail. For example, geographic information such as address can be recoded to suburb or postal code area, age can be recoded to 5-year or 10-year intervals and age over a certain threshold can be recoded to simply being over that threshold. These are examples of collapsing levels of an ordinal categorical variable and topcoding, respectively. Variables can also be bottom-coded; for a study of the use of cell collapsing to maintain confidentiality of medical data, see [84]. Recoding reduces disclosure risk by ensuring that records with potentially unusual or unique values (such as extreme old age or a specific date of birth) are only reported as belonging to a group with a range of such values. Sampling Releasing a sample of the original dataset, such as a smaller number of cases than were collected in the original data, can reduce disclosure risk if a potential attacker does not know who is in the sample and if cases in the sample are not unique in the population on variables known to the potential attacker. As a measure of disclosure risk, the probability that a record that is unique in the sample is also unique in the population is generally approximated by the sampling fraction, see for example [85], implying that disclosure risk generally decreases as the sampling fraction decreases.

C. M. O’KEEFE AND D. B. RUBIN

3.4.3. Perturbative methods for microdata. Rounding Original variable values are replaced by rounded values rounded to multiples of a given number such as 3 or 5. Rounding reduces disclosure risk by reducing the precision of variable values and the likelihood of unique variable values. Data swapping Data swapping transforms a database by interchanging values of a pair of variables between records in a microdata file. Data swapping reduces the risk of identity disclosure by spontaneous recognition. It is also intended to discourage potential attackers from matching records because any matches found could well be incorrect. Additive or multiplicative noise addition Randomly distributed noise values can be added to the data, or the data can be multiplied by randomly distributed values. Additive noise can be uncorrelated or correlated and can be augmented with a linear or non-linear transformation. Additive and multiplicative noise are generally only suitable for continuous data. They both reduce disclosure risk by preventing exact matching; however, approximate matching may still be possible. Micro-aggregation Micro-aggregation is applied by clustering records into small groups of similar records and replacing individual record values by the cluster average values. Micro-aggregation is based on the rationale that if each record in a microdata file corresponds to at least k (say) individuals, then each individual’s record is indistinguishable from at least k others so that the risk of exact identification is reduced. Post-randomisation method (PRAM) The PRAM technique is applied to categorical data and involves a form of intended misclassification using a known and pre-set probabilistic mechanism. Under PRAM, for each record in a microdata file, the value of one or more categorical variables is changed with a certain probability. PRAM reduces the risk of identity disclosure through spontaneous recognition or data matching. Synthetic data In some situations, the methods mentioned here can be shown to have significant negative impact on data utility; see for example [33, 34]. Motivated by the drawbacks associated with traditional statistical disclosure control, especially for very sensitive datasets, Rubin [86] suggested the alternative of generating and releasing (fully) synthetic data; see also [87] and [88]. In the generalisation to partially synthetic data, the data custodian releases a dataset comprising the original records with some observed values replaced with multiple imputations drawn from distributions designed to preserve important relationships in the confidential data. Disclosure risk is reduced because the original data are not released; however, if the model used to generate the imputations fits extremely well and the synthetic data are very close to the original data, then adequate protection may not be achieved. In many applications, the process of estimating effective statistical models needed for the synthesis can be a difficult and labour-intensive task. To avoid this difficulty, it has recently been proposed that statistical model estimation could be replaced by easily implemented methods from machine learning, as has been performed in missing data imputation [89,90]. Drechsler and Reiter [91] conducted a simulation study to compare four data synthesisers based on machine learning algorithms, using a subset of the 2002 Uganda census public use file. Their evaluation suggests that synthesisers based on regression trees can give rise to synthetic datasets that provide reliable estimates and low disclosure risks and can be implemented easily. 3.4.4. Applicability to continuous and categorical data. The non-perturbative microdata methods are effective for categorical data, but may not provide adequate protection to continuous data [48]. Exceptions are top-coding and bottom-coding, which can be used for both. In the case of perturbative methods, PRAM data swapping, micro-aggregation and synthetic data can be used to protect categorical data. Rounding, the use of noise, micro-aggregation and synthetic data are applicable to continuous data.

3090

3.4.5. Examples of restricted information approaches on microdata. The National Institute of Statistical Sciences package NISSWebSwap [92] is a Web Service that swaps one or more attributes between user-specified records in a microdata file, uploading the original data file from the user’s computer and downloading the file containing the swapped records. The ABS Confidentialised Unit Record Files (CURFs) contain data from ABS surveys in the form of unit records; see [56]. CURFs have been confidentialised by removing name and address information, controlling the amount of detail and changing a small number of values through the application of Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

C. M. O’KEEFE AND D. B. RUBIN

statistical disclosure control techniques. CURFs are made available to researchers and analysts through a variety of different access modes. Internationally, Integrated Public Use Microdata Series, International (IPUMS-International) is a project dedicated to collecting and distributing census data from around the world [58]. IPUMS-International works with each country’s statistical office to minimise the risk of disclosure of respondent information. The details of the confidentiality protections vary across countries, but in all cases, names and detailed geographic information are suppressed, and top-codes are imposed on variables such as income that might identify specific persons. In addition, IPUMSInternational uses a variety of technical procedures to enhance confidentiality protection, including the following: • Swapping an undisclosed fraction of records from one administrative district to another to reduce the risk of positive identification of individuals. • Randomizing the placement of households within districts to disguise the order in which individuals were enumerated or the data processed. • Aggregating categories of a characteristic (e.g. grouping together very small ethnic categories) • Top-coding and bottom-coding continuous variables to prevent identification of extreme cases. There are several examples of partially synthetic datasets already released for research. For example, the US Bureau of the Census has released a partially synthetic, public use file for the Survey of Income and Program Participation including imputed value of Social Security benefits information and dozens of other highly sensitive variables [93]. The Census Bureau also seeks to reduce disclosure risk for people living in prisons, shelters and similar in the American Community Survey by replacing identifying information for records at high disclosure risk with imputations [94]. More recently, a synthetic public use file for the US Longitudinal Business Database, an annual economic census of US establishments, has been approved for release by the US Bureau of the Census and the Internal Revenue Service [38]. An and Little [95] proposed the use of multiple imputation as an alternative to top-coding for preventing disclosure associated with high-age individuals in cross-sectional health data and extended it to the context of the treatment of age in longitudinal data sets [96]. They showed that on a dataset associated with a heart study, the approach works well in preserving the relationship between hazard and covariates. The National Center for Health Statistics has created a public-use version of the National Health Interview Survey Linked Mortality files using a combination of random perturbation (for quarter and/or year of death) as well as imputation (for cause of death) [97]. 3.4.6. Confidentiality protection in epidemiology and public health surveillance. We finish this section with a brief overview of methods for confidentiality protection in epidemiology and public health surveillance datasets. In these types of data, the health information of each individual in the dataset is typically associated with their home address, and the use of spatially based methods and maps is essential. The need to protect the confidentiality of such spatial health information is well recognised, and several approaches have been proposed and evaluated; see [98, 99]. Following Curtis et al. [100], in general, there are three ways to reduce confidentiality risk in a fine scale spatial data set: • Perturb a point layer (such as coordinates or geocoded address) to reduce the possibility of a displayed point being reattached to its original location [101–104] • Aggregate the points so that individual locations cannot be identified • Use simulated data that mirror the real spatial distribution, [105] Each of these approaches is a special case of the general methods described in this section. 3.5. Restricted information methods–tabular data

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

3091

Although tabular data are aggregated, there still may be unacceptable disclosure risk. Perhaps the most common disclosure risk is associated with a cell of a table that relates to only one individual, where an identity disclosure may occur by data matching from the characteristics in the table. Restricted information methods normally comprise the application of some statistical disclosure limitation techniques for tabular data; see [28, 48]. As in the case of microdata, tabular statistical disclosure control techniques can be perturbative or non-perturbative.

C. M. O’KEEFE AND D. B. RUBIN

In this section, we describe techniques developed for tabular data. There are two main classes of confidentiality methods for tabular data, namely pre-tabular and post-tabular. Pre-tabular methods modify microdata before aggregation into a table, while post-tabular methods modify a table directly. 3.5.1. Pre-tabular methods. Perhaps the most widely used pre-tabular method is table redesign, including collapsing of categories along any axis. In addition, any of the methods presented in Section 3.4 could be used as a pre-tabular confidentiality method. The assumption is that if confidentiality has been sufficiently protected in the microdata, then it will also be sufficiently protected in the tabular data. In most applications, the impact of the microdata protection method on the table is not immediately obvious and needs investigation. An exception is data swapping, where records can be swapped in such a way as to maintain the marginal counts of the table; see [106] 3.5.2. Post-tabular methods. Post-tabular Statistical Disclosure Control for tabular data is conducted in two steps. The first step is to identify whether the release of the data in any table cells could lead to a disclosure, and the second step is to reduce the disclosure risk associated with the identified cells. In post-tabular statistical disclosure control, the term sensitive has been traditionally used to describe table cells with high disclosure risk. In this and the following section, we will respect this tradition, noting that it does not necessarily imply that the information in the table cell is sensitive in the normal sense of the word. The most commonly used cell sensitivity tests are as follows: • Threshold rule–a cell is sensitive if less than n individuals contribute to its value (frequency and magnitude tables) • (n, k) -rule–a cell is sensitive if less than n individuals contribute at least k% of its value (magnitude tables) [107] For a discussion of the shortcomings of these techniques; see [108]. After the sensitive cells in a table have been identified, the second task is to take steps to address the disclosure risk. The most commonly used techniques are the following: • Deletion of variables–removing table axes corresponding to variables that lead to sensitive cells. • Variable recoding–adjusting the level of aggregation of variables to reduce the number of sensitive cells. This method aggregates all cells involving the recoded variable, whether they have been identified as sensitive or not. • Cell collapsing - merging pairs of cells until no sensitive cell remains. This method only aggregates the sensitive cells, but can make analysis more difficult. • Cell suppression–suppression of the entry in each sensitive cell, then suppression of entries in nonsensitive cells sufficient to prevent reconstruction of any sensitive cell value. • Rounding–rounding all cells to a multiple of a chosen positive integer, for example 3 or 5. • Addition of noise–altering sensitive cell values (and usually also non-sensitive cell values) by the addition of noise sampled from some distribution. Examples of this method include the Post-Randomization Method [109] and the key-based method in [110]. Perhaps the most widely known software package for statistical disclosure control on tabular data, tARGUS, uses variable recoding and cell suppression as its main techniques for protecting sensitive cells, see [111]. The ABS Census TableBuilder [112] is an online tool that allows users to create custom tables of Census data from variables including age, education, housing, income, transport, religion, ethnicity, occupation, family composition and more for all ABS geographic areas. As described in [64], a confidentialisation process is applied dynamically after the data table has been constructed. All cell values, subtotals and totals are randomly slightly adjusted to prevent any identifiable data being exposed. The adjustments are carried out in such a way that consistency of cell values across different tables constructed from the same data set is maintained.

3092

3.5.3. Applicability to count and magnitude data. The key disclosure problem for count data is the presence of cells with very small counts, typically those less than three. Applicable disclosure risk reduction methods include table redesign, rounding, controlled rounding, perturbation, controlled tabular adjustment and cell suppression. Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

C. M. O’KEEFE AND D. B. RUBIN

In the case of magnitude data, the count of individuals contributing to a cell is normally assumed to be known, so that only the magnitude variable value associated with the cell has the potential for being identified as sensitive as in Section 3.5.2. The goal is to make it hard to estimate the value accurately, so that the value for any one individual cannot be estimated accurately by any table user including a survey respondent contributing to the same cell value. Applicable disclosure risk reduction methods include table redesign including variable recoding, rounding, perturbation and cell suppression. 3.6. Analysis output confidentialisation methods Currently, virtual data centres rely on manual checking for confidentiality protection, such as those outlined in [48]. This solution may not be feasible in the long term, given the trend of rising user demands for data access. Although it is acknowledged that developing valid output checking processes that are automated is an open research question [28], there have been some advances in such methods developed in the context of remote analysis systems, but applicable to virtual data centres as well, as follows. Suppression was the main proposed automated output confidentialisation measure in table servers [61] and in early work on linear regression servers [59]. In other early work on remote analysis systems for model fitting, Reiter [113] suggested that a remote analysis system should release only synthetic regression diagnostics–that is, simulated values of residuals and response and explanatory variables. For model fitting involving categorical explanatory variables, in particular logistic and multinomial regressions, the release of grouped diagnostics was proposed [114]. Later contributions included the investigation of grouped plots of residuals [60]. A first generation of remote analysis system with little use of suppression was proposed by Sparks et al. [60,115], with disclosure risk reduced by a combination of modifying or restricting standard statistical analyses through a menu-driven interface, as well as modifying the output of fitted models. The system made use of restricted analysis methods such as using a specified robust regression instead of traditional regression, restrictions on the variable transformations allowed and restrictions on the level of variable interactions permitted. In addition, it also relied on restricted output methods including rounding of estimated model parameters, suppression of covariance matrix, fitted values and residuals in model fitting and suppression of parameter values if there is extremely high model fit. Output tables would be replaced with a correspondence analysis plot, and diagnostic plots would be confidentialised through aggregation and/or smoothing. Second-generation systems are now in development in the US Census Bureau [65] and the ABS [64]. Importantly, these new systems do not only rely on restricted output methods alone but also make use of a combination of protective measures from the restricted access, restricted data, restricted analysis and restricted output groups of methods. We remark that in addition to the problem of balancing disclosure risk with data utility, remote analysis systems present additional technical statistical challenges in addressing, for example missing data, outliers, selection bias testing, assumption checking and additional disclosure risks because of multiple, interacting queries.

4. Taxonomy of approaches In this section, we provide a taxonomy of the aforementioned approaches to balancing research use of data with confidentiality protection, based on a conceptual model for data analysis. The taxonomy extends and updates previously published taxonomies [53, 54].

3093

Figure 1. Conceptual model for data analysis.

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

C. M. O’KEEFE AND D. B. RUBIN

4.1. Conceptual model for data analysis Our taxonomy will make use of the conceptual model for data analysis in Figure 1 [116]. The model has four stages, represented by four boxes in Figure 1, as follows: • Dataset stage: The custodian agency prepares a dataset, and makes it available through the secure analysis laboratory. The researcher selects the dataset from the list of available datasets. • Data transformations stage: The researcher applies data transformations including data subsetting, new variable creation and variable transformations in order to tailor the dataset to the desired analysis. The researcher may skip this stage. • Query stage: The researcher submits an analysis to be run on the dataset, and results are generated. The sub-stages of the Query stage are the following: – Analysis stage: The researcher submits an analysis request, which is performed on the dataset – Output stage: Output comprising the analysis results is generated and viewed on the computer screen • Output for Publication stage: The output is prepared for inclusion in an academic publication, presentation, report or other dissemination channel. This may include modification of output in order to protect confidentiality. Distinguishing Output from Output for Publication allows for the fact that researchers may be authorised to view the analysis outputs; however, readers of the academic literature may not be. The task of balancing access and use of health data with confidentiality protection involves introducing protective measures at one or more of the stages in the conceptual model. For example, the microdata may be confidentialised or not, and the analyst may be able to view the microdata on their screen or not. The range of available data transformations (such as variable transformations) may be restricted or not, and individual available data transformations may be restricted or not. The range of available analyses may be restricted or not, and individual available analyses may be restricted or not. The analyst may be able to see intermediate results on their screen or not, and intermediate results displayed on the analyst’s screen may be confidentialised or not. Final results displayed on the analyst’s screen may be confidentialised or not. 4.2. Expanded and updated taxonomy Recall from Section 3 that traditional approaches to reducing disclosure risk fall into the two categories [53]: • restricting access, wherein the access to data is itself restricted; and • restricting data, wherein the amount or format of the data released is subject to restrictions. The newer remote and virtual approaches also involve restricting queries, and restricting output, where restricting queries involves imposing restrictions on the range of transformations and analyses that can be performed on the data, and restricting output involves suppressing or modifying anaysis results. While these could be viewed as restricting data methods, in this paper, we prefer to distinguish them from the range of methods that have traditionally been considered as restricting data. The relationship between these categories and the stages of the conceptual model, that is, which categories are implemented at each stage, is given in Table I. Approaches to restricting data, queries and output can also be viewed from another standpoint, under which they fall into two paradigms depending on the stage at which confidentialisation measures in the form of suppressions or perturbations are applied [45]. Under the first paradigm of input perturbation, the Table I. Mapping between categories and stages of conceptual model. Stages Overall

3094

Restricting access Restricting data Restricting queries Restricting output

Dataset

Data Transformations

Analysis

X

X

Output

X

Copyright © 2015 John Wiley & Sons, Ltd.

X X

Statist. Med. 2015, 34 3081–3103

C. M. O’KEEFE AND D. B. RUBIN

Table II. Mapping between paradigms, categories and methods of disclosure risk reduction. Input perturbation

Output perturbation

Deidentification

Microdata methods

Pre-tabular methods

Remote analysis methods

X

X

X

X X X

Restricting data Restricting queries Restricting output

Table III. Relationship between nature of statistical data product and classes of disclosure risk reduction methods. Statistical data product provided to the researcher Microdata

Table

Analysis output

User agreement for offsite use

Non-perturbative and/or perturbative methods

N/A

N/A

Remote analysis

N/A

Post-Tabular methods for restricting output, with possibly some light application of pre-tabular methods for restricting data

Analysis output confidentialisation measures combining restricting data, queries, and/or outputs

Virtual data centre

N/A

Post-Tabular methods for restricting output, with possibly some light application of pre-tabular methods for restricting data

Analysis output confidentialisation measures combining restricting data and/or outputs

Onsite data centre

N/A

Post-Tabular methods for restricting output, with possibly some light application of pre-tabular methods for restricting data

Analysis output confidentialisation measures combining restricting data and/or outputs

Mode of access to information

data are first confidentialised then analysed. The approaches that embody this paradigm are removal of identifying information (de-identification) and non-perturbative and perturbative methods for microdata (Section 3.4), as well as pre-tabular and post-tabular methods for tabular data (Section 3.5). Whether post-tabular methods are considered input perturbation or output perturbation depends on whether the table is an input to another analysis or is the output itself. Under the second paradigm of output perturbation, the data are first analysed, and only the results are confidentialised. The approaches that embody this paradigm are the restricted output methods discussed in Section 3.6. The relationship between these two paradigms, the categories of disclosure risk reduction and the main classes of methods is given in Table II. In the next table, we summarise the relationship between the nature of a statistical data product provided to a researcher and the classes of disclosure risk reduction methods; see Table III. We assume all data have been de-identified. 4.3. Applicability of the approaches

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

3095

It is important to have a range of methods available because there is also a range of data-use scenarios, and it is important to be able to choose between methods suited to differing scenarios. As an example

C. M. O’KEEFE AND D. B. RUBIN

of this choice, when two Australian organisations recently sought to implement data access solutions involving output perturbation, differences in the nature of the data and the user communities led one to select remote analysis and one to select a virtual data centre [71]. As another example, business data and social data have different characteristics, and traditional disclosure control methods developed in the context of social data may not be immediately applicable to business data [117].

Table IV. Characteristics of the three main types of disclosure reduction methods. Microdata release

Data centre

Remote analysis

3096

Type of data

Static and/or structured

Dynamic and/or unstructured, diverse, longitudinal

Dynamic and/or unstructured, diverse, longitudinal

Custodian control over microdata

Low control once released to user

Moderate control

High control

Confidentialisation effort

Dataset confidentialised once and used for many different analyses by a range of users

Generally manual checking of individual outputs

Generally automated systems designed to produce only confidentialised output

Updating or removing datasets

Very difficult once released to user

Straightforward

Straightforward

User community

Generally researchers and policy analysts of medium to high statistical sophistication

Generally researchers and policy analysts of medium to high statistical sophistication

Broad range of users with diverse requirements and varying statistical sophistication

Access to microdata

Unrestricted access to confidentialised microdata on own computer

Unrestricted access in secure environment – usually light confidentialisation applied

None

Analytical functionality

Unrestricted

Unrestricted

Heavily restricted

Implications for modelling

Statistical modelling on perturbed data may be more complex

Standard analyses conducted on original microdata

Standard analyses conducted on original microdata

Treatment of sensitive variables

Generally suppressed – unavailable in models

Can be available as explanatory variables in model fitting

Can be available as explanatory variables in model fitting

Type of output

Full range

Full range – if assessed to have low disclosure risk

Broad descriptive statistics, exploratory data analysis, regression coefficients, residuals and diagnostics

Impact on output

Measurement error can reduce accuracy and power – analyst cannot quantify or check extent

User can verify confidentialised output still leads to valid inferences

User cannot verify that confidentialised output still leads to valid inferences

Audit of analyses conducted

Not possible

Possible

Possible

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

C. M. O’KEEFE AND D. B. RUBIN

Table III indicates that there are two basic types of data access method–one involving input perturbation of microdata to be released to a user for offsite use governed by a user agreement and one involving output perturbation of tables and/or analysis results released to a user. The output perturbation approaches are further broken down into systems where the user has no access to the microdata (remote analysis) and data centres where the user has full access to the microdata (either on-line or in person). Table IV outlines applicability characteristics of these three approaches, as guidance in the choice of approach. 4.4. Advantages and disadvantages of the approaches

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

3097

In this section, we provide a high level discussion of the advantages and disadvantages of the approaches, including the potential impact on the quality of inferences, and refer the reader to the literature for more detailed critiques and comparisons. We assume that all datasets have had all identifying information removed. The general advantages and disadvantages of the restricted access methods of microdata release, data centres and remote analysis have been presented in Table IV. From the point of view of the analyst, probably the main difference is that in remote analysis, the analyst can never see the microdata. This would be unacceptable in many scenarios requiring, for example significant data preparation, model development and provably valid inferences. However, we believe that remote analysis may be acceptable in some scenarios. Perhaps the most compelling example is that access to confidential data through a remote analysis system may be viewed as low risk and so may require only a lightweight ethics approval process. This would enable an analyst to have an initial exploration of the data and perhaps find out whether a full ethics application for access to the data itself would be worthwhile. As a bonus, the analyst would have preliminary results or an indication of statistical power to include in a grant application. We note that virtual data centres perhaps overcome many of the disadvantages of remote analysis, while preserving some of the benefits. This is because the analyst has full access to the data and can verify that confidentialised output preserves the quality of inferences. While we do not believe that data access methods involving output perturbation will ever completely replace methods involving input perturbation, we do agree with the authors in [64] in their statement that: ... recent events in the development of remote analysis servers herald the dawn of a new era in automated confidentiality protection for analysis .... We now discuss the underlying restricted information methods. In the case of microdata, a recent synthesis of potential biases that can result from disclosure treatments, as well as when and how it is possible to avoid or correct those biases, has been presented in [118]. The methods of variable recoding, swapping, adding noise to variables and creating partially synthetic data are considered in detail, and we provide a summary here. Recoding can lead to ecological fallacies (relationships estimated at a high level of aggregation are not estimated at lower levels), smoothing of spatial variation and masking of variation between collapsed levels. Top-coding or bottom-coding prevents inferences beyond the thresholds, and truncating tails can impact estimation of whole-dataset quantities [119]. Analyses involving both swapped and un-swapped variables can be distorted [34, 120]. For numeric data and additive noise, normally the noise is selected with distribution of mean zero, so that point estimates remain unbiased although variance may increase, marginal distributions can be substantially altered and noise increases measurement error of explanatory variables [41]. If the replacement imputations underlying synthetic data are generated with valid models, analysts can obtain valid inferences for a wide class of estimands via a simple process [113, 121]. However, if entire variables are simulated, then inferences involving those variables are determined by the relationships in the data generation models. The article concludes that all disclosure treatments degrade statistical analyses, and it is difficult if not impossible for analysts to determine the reduction in quality of a given inference. This difficulty is partly because of the reluctance of data custodians to reveal their confidentiality methods or parameter settings. On the brighter side, many public-use microdata sets are samples with probably modest amounts of confidentialisation directed at preserving the most important statistical features, and so most inferences are probably not strongly affected [118]. As an exception to this statement, geographic and temporal variables are normally considered to introduce high disclosure risk, and so are usually strongly recoded. For tabular data, and in the case that microdata methods are applied pre-tabulation, it could be expected that many of the advantages and disadvantages of microdata methods transfer; however, in most applications, the impact of the microdata protection method on the table is not immediately obvious. With regard to post-tabular methods, most of the methods for identifying the so-called sensitive cells fail to identify all such cells and incorrectly identifies some non-sensitive cells as sensitive [108, 122]. While some of the impacts of restricted information methods such as recoding and noise are similar to microdata, in the

C. M. O’KEEFE AND D. B. RUBIN

case of tabular data, there are additional impacts. For example, suppression and cell collapsing can lead to irregular table structures that are difficult to interpret and analyse [123].

5. Discussion It has been the purpose of this paper to provide a structured overview of a number of approaches to reducing disclosure risk when making data available for research. Some of these methods are widely used and some are still in development. It is important to have a range of methods because there is also a range of data use scenarios, and it is important to be able to choose the method best suited to a given scenario. A discussion of the applicability, advantages and disadvantages of the various approaches has been presented to assist with selecting the appropriate method for a given situation. This paper has also described a taxonomy for the methods in terms of a conceptual model for analysis and the two overarching paradigms of input perturbation and output perturbation. We have provided, in Table IV, a summary of the applicability and features of these three approaches, as guidance in the choice of approach most suitable in a given scenario. In particular, disadvantages of one approach in a given scenario can be advantages in a different scenario. As an example, we consider two Australian organisations (ABS and PHRN) that chose different approaches to implementing a data access solution involving output perturbation. We can understand the different choices in the framework of Marsh et al. [55], namely Pr(disclosure) = Pr(attempt) ⋅ Pr(disclosure ∣ attempt). The ABS environment of essentially public use required it to assume that Pr(attempt) is close to 1, and therefore, it needed to seek to minimise Pr(disclosure ∣ attempt). The PHRN environment of a known user community enabled it to assume that Pr(attempt) is negligible, and therefore, it did not need to minimise Pr(disclosure ∣ attempt). Thus, differences in the nature of the data and the user communities led the ABS to select remote analysis and the PHRN to select a virtual data centre [71]. Table IV is an attempt to provide an overview of the elements of any scenario that need to be considered in selecting an approach to balancing access and use of data with confidentiality protection. Once an approach is selected, the custodian can take advantage of the benefits of the approach, and seek to minimise the impact of the disadvantages. In any given scenario, however, one or more of these could be extremely important and override all other elements. In particular, the requirement for data analysts to be able to interact directly with data is often such a ‘deal-breaker’, so in practice, the choice between an input perturbation and an output perturbation approach is probably the most fundamental one for any custodian seeking to make data available for research. 5.1. Statistical challenges

3098

The problem of finding a balance between allowing the use of health and medical data in research, and ensuring that confidentiality is adequately protected, raises a number of interesting statistical challenges. The most fundamental of these, already mentioned in Section 2.4, is the challenge of evolving the riskutility paradigm to the point where it can be used for evaluating and choosing between confidentiality methods for statistical disclosure limitation (SDL). One of the discussion articles following Cox, Karr and Kinney [32] reframes the original article as a contribution to the search for a unifying theoretical foundation and framework for SDL research and practice [124]. The discussion article notes that one of the key gaps in the development of the risk-utility paradigm as a theoretical foundation for SDL is the general lack of widely accepted measures of disclosure risk and data utility. In searching for measures of disclosure risk and data utility, and in order to apply the risk-utility paradigm, it will be vital to fully understand and make use of important differences between legitimate users and potential attackers. Although many current disclosure risk measures rely on estimates of record-level probabilities of correct re-identification of a data subject from masked data, newer methods making use of, for example entropy [27, 125], information loss [51] and differential privacy [38], indicate that there are other ways to think about disclosure risk and data utility measures. Once widely accepted measures of disclosure risk and data utility have been developed, a new suite of statistical challenges arise. The next problem will be to evaluate the performance of existing SDL methods and to evolve new ones with acceptable risk-utility characteristics across a broad or narrow range of scenarios involving data, custodian constraints, user, analysis, potential attacker and so on. Finally, it would be highly desirable to have a method to select the best SDL method for any given scenario. The confidentialisation (SDL) methods described in this paper were generally developed in the context of social data of reasonable volume, about households or individuals held by National Statistical Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

C. M. O’KEEFE AND D. B. RUBIN

Agencies. It is necessary to evaluate their applicability to different types of data such as business data [117] and genome-wide association studies [126]. Addressing the statistical challenges noted earlier will be even more complex in the face of emerging technologies and use cases including the increasing use of synthetic, linked and longitudinal data in research and the augmentation of survey data with administrative data. The impending challenge of socalled big data may require a whole new class of approaches. The relatively new remote analysis approach raises a range of additional statistical challenges. First, it is unclear how a remote analysis system with output confidentiality processes is able to adequately enable the design phase of an analysis involving many transformations, sometimes hundreds, with diagnostics. In particular, how can analysts be sure that they are getting valid results after an output confidentialisation process has been applied? We remark that verification servers [127] have been proposed as a potential solution to this issue; however, to our knowledge, none has yet been successfully implemented. Remote analysis systems present additional technical statistical challenges in addressing, for example missing data, outliers, selection bias testing, assumption checking and additional disclosure risks due to multiple, interacting queries. 5.2. General challenges In addition to the statistical challenges outlined in the previous section, the preparation of this review has highlighted several more general issues and questions that we outline here. First, some articles in the confidentiality literature suggest that only disclosures that cause actual harm should be of concern. This would be different from the justice system in which a person can be convicted of a crime even though no actual harm results. It is clear that data linkage initiatives are becoming more common internationally. However, what is not clear is how the confidentiality standards for the resulting linked dataset, formed from diverse component datasets with various confidentiality obligations, will be managed. The confidentialisation processes described in this paper distort either the data or the results to a greater or lesser degree. What should happen if the confidentialisation process used by a given organisation causes invalid inferences? For example, the p-values computed from a table confidentialised with a tabular perturbation method will probably be wrong, and the removal of outliers will contract confidence intervals. It is unclear whether the organisation who performed the confidentialisation should be liable for adverse consequences. More generally, what are (or should be) the legal obligations of an organisation implementing a remote analysis system? In this case, the host organisation is conducting a requested analysis on behalf of an analyst, a paradigm that is significantly different to providing data for the analyst to analyse themselves. A question that arises is, what should happen if the analyst requests a meaningless analysis, or one that has a flawed design? It is unclear to what degree the organisation should be responsible for ensuring that the dataset can support the requested analysis and the analyst makes valid inferences from the data.

Acknowledgements We thank the Associate Editor and the anonymous reviewers for their careful consideration of the paper. Their many thoughtful and helpful suggestions have led to a better paper.

References

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

3099

1. Safran C, Bloomrosen M, Hammond W, Labkoff S, Markel-Fox S, Tang P, Detmer D. Toward a national framework for the secondary use of health data: an American medical informatics association white paper. Journal of the American Medical Informatics Association 2007; 14:1–9. 2. Weiner M, Embi P. Toward reuse of clinical data for research and quality improvement: the end of the beginning? Annals of Internal Medicine 2009; 151:359–360. 3. Jablensky A, Morgan V, Zubrick S, Bower C, Yellachich L. Pregnancy, delivery, and neonatal complications in a population cohort of women with schizophrenia and major affective disorders. The American Journal of Psychiatry 2005; 162:79–91. 4. Hauck Y, Rock D, Jackiewicz T, Jablensky A. Healthy babies for mothers with serious mental illness: a case management framework for clinicians. International Journal of Mental Health Nursing 2008; 17:383–391. 5. Framingham Heart Study, National Heart, Lung and Blood Institute and Boston University. http://www. framinghamheartstudy.org [Accessed on 23 October 2014].

C. M. O’KEEFE AND D. B. RUBIN

3100

6. Roos LL, Wajda A. Record linkage strategies: part 1: estimating information and evaluating approaches. Technical Report 28, University of Manitoba, Winnipeg, 1990. 7. Gill L. OX-LINK: The Oxford Medical Rrecord Linkage System. Record linkage techniques. Technical Report 19, University of Oxford, Oxford, 1997. 8. Kendrick S, Clarke JA. The Scottish medical record linkage system. Health Bulletin (Edinburgh) 1979; 51:72–79. 9. Holman CDJ, Bass AJ, Rouse IL, Hobbs MS. Population-based linkage of health records in Western Australia: development of a health services research linked database. Australian and New Zealand Journal of Public Health 1999; 23: 453–459. 10. Ford DV, Jones KH, Verplancke JP, Lyons RA, John G, Brown G, Brooks CJ, Thompson S, Bodger O, Couch T, Leake K. The SAIL Databank: building a national architecture for e-health research and evaluation. BioMed Central Health Services Research 2009; 9:157. 11. National Insititutes of Health. Final NIH Statement on Sharing Research Data. Notice: NOT-OD-03-032 February 26 2003. http://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html [Accessed on 23 October 2014]. 12. National Institutes of Health. NIH genomic data sharing policy. Technical Report, NIH 27, August 2014. http://grants. nih.gov/grants/guide/notice-files/NOT-OD-14-124.html [Accessed on 23 October 2014]. 13. Samet J. Data: to share or not to share. Epidemiology 2009; 20:172–174. 14. Hernán M, Wilcox A. Epidemiology, data sharing, and the challenge of scientific replication. Epidemiology 2009; 20: 167–168. 15. Sweeney L. Matching known patients to health records in Washington State data, Harvard University. Data Privacy Lab, June 2013. 16. Johnson AD, Leslie R, O’Donnell CJ. Temporal trends in results availability from genome-wide association studies. Public Library of Science Genetics 2011; 7(9):e1002269. DOI: 10.1371/journal.pgen.1002269. 17. Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y. Identifying personal genomes by surname inference. Science 2013; 339(6117):321–324. 18. Carl K. It’s personal: Privacy concerns associated with personal health records. ISJLP 2010; 5:533–603. 19. El Emam K. A Guide to the De-identification of Health Information. CRC Press: New York, NY, 2013. 20. US Government. Health Insurance Portability and Accountability Act (HIPAA), 1996. http://www.gpo.gov/fdsys/pkg/ PLAW-104publ191/content-detail.html [Accessed on 1 June 2015]. 21. Office for Civil Rights. Guidance regarding methods for de-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. U.S. Department of Health and Human Services September 4 2012: Washington, DC. 22. McGraw D. Building public trust in uses of Health Insurance Portability and Accountability Act de-identified data. Journal of the American Medical Informatics Association 2013; 20(1):29–34. 23. European Union. Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data 1995. http://eur-lex. europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31995L0046:en:HTML [Accessed on 23 October 2014]. 24. European Parliament. Draft Report. http://www.europarl.europa.eu/meetdocs/2009_2014/documents/libe/pr/922/ 922387/922387en.pdf [Accessed on 23 October 2014]. 25. Holman C. Anonymity & Research: Health Data and Biospecimen Law in Australia. Uniprint, The University of WA: Perth, 2012. 26. O’Keefe CM, Connolly C. Privacy and the use of health data for research. Medical Journal of Australia 2010; 193: 537–541. 27. Duncan GT, Keller-McNulty SA, Stokes SL. Disclosure risk vs data utility: the R-U confidentiality map. Technical Report LA-UR-01-6428, Los Alamos National Laboratory, 2001. 28. Duncan G, Elliot M, Salazar-Gonzàlez JJ. Statistical Confidentiality. Springer: New York, 2011. 29. Cox L. Overview of statistical disclosure limitation. DIMACS Working Group on Challenges for Cryptographers in Health Data Privacy 10–12 December 2003, Rutgers University. 30. Fienberg S. Confidentiality and disclosure limitation. Encyclopedia of Social Measurement 2005; 1:463–469. 31. Skinner C, Shlomo N. Assessing identification risk in survey microdata using log-linear models. Journal of the American Statistical Association 2008; 103:989–1001. 32. Cox L, Karr A, Kinney S. Risk-utility paradigms for statistical disclosure limitation: how to think, but not how to act. International Statistical Review 2011; 79:160–183. 33. Elliot M, Purdam K. A case study of the impact of statistical disclosure control on data quality in the individual UK samples of anonymized records. Environment and Planning A 2007; 39:1101–1118. 34. Winkler W. Examples of easy-to-implement, widely used methods of masking for which analytic properties are not justified. Technical Report, U.S. Bureau of the Census, 2007. 35. Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. 3rd IACR Theory of Cryptography Conference, New York, USA, 2006, 265–284. 36. Dwork C, Smith A. Differential privacy for statistics: what we know and what we want to learn. Journal of Privacy and Confidentiality 2009; 1:135–154. 37. Kifer D, Machanavajjhala A. No free lunch in data privacy. Proceedings of the SIGMOD ’11, Athens, Greece, 2011, 193–204. 38. Kinney SK, Reiter JP, Reznek AP, Miranda J, Jarmin RS, Abowd JM. Towards unrestricted public use business microdata: the synthetic longitudinal business database. International Statistical Review 2011; 79(3):362–384. 39. Steinberg J, Pritzker L. Some experiences with and reflections on data linkage in the United States. International Statistical Institute, 1967. 40. Bachi R, Baron R. Confidentiality problems related to data banks. Bulletin of the International Statistical Institute 1969; 43:225–241.

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

C. M. O’KEEFE AND D. B. RUBIN

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

3101

41. Fuller WA. Masking procedures for microdata disclosure limitation. Journal of Official Statistics 1993; 9:383–383. 42. Dalenius T. Towards a methodology for statistical disclosure control. Statistik Tidskrift 1977; 15:429–444. 43. Duncan GT, Lambert D. Disclosure-limited data dissemination. Journal of the American Statistical Association 1986; 81(393):10–18. 44. Duncan G, Lambert D. The risk of disclosure for microdata. Journal of Business and Economic Statistics 1989; 7:207–217. 45. Adam N, Wortmann J. Security-control methods for statistical databases: a comparative study. ACM Computing Surveys 1989; 21:515–556. 46. Fellegi IP. On the question of statistical confidentiality. Journal of the American Statistical Association 1972; 67(337): 7–18. 47. Duncan G. Preface, special issue. Journal of Official Statistics 1993; 9:269–270. 48. Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Nordholt E, Spicer K, de Wolf PP. Statistical Disclosure Control, Wiley Series in Survey Methodology. John Wiley & Sons: United Kingdom, 2012. 49. Malin BA, El Emam K, O’Keefe CM. Biomedical data privacy: problems, perspectives, and recent advances. Journal of the American Medical Informatics Association 2013; 20:2–6. 50. Fienberg S. Statistical perspectives in confidentiality and data access in public health. Statistics in Medicine 2001; 20:1347–1356. 51. O’Keefe CM. Privacy and the use of health data - reducing disclosure risk. Electronic Journal of Health Informatics 2008; 3(1):E5 (9pp). 52. Reiter J, Kinney S. Commentary: Sharing confidential data for research purposes: a primer. Epidemiology 2011; 22: 632–635. 53. Duncan G, Pearson R. Enhancing access to microdata while protecting confidentiality: prospects for the future. Statistical Science 1991; 6:219–239. 54. O’Keefe CM, Good N. Regression output from a remote analysis system. Data and Knowledge Engineering 2009; 68:1175–1186. 55. Marsh C, Skinner C, Arber S, Penhale B, Openshaw S, Hobcraft J, Lievesley D, Walford N. The case for samples of anonymized records from the 1991 census. Journal of the Royal Statistical Society Series A 1991; 154:305–340. 56. Australian Bureau of Statistics. About CURF microdata. http://www.abs.gov.au/websitedbs/D3310114.nsf/home/ About+CURF+Microdata [Accessed on 23 October 2014]. 57. Centers for Disease Control and Prevention. Public-use data files and documentation. http://www.cdc.gov/nchs/data_ access/ftp_data.htm [Accessed on 23 October 2014]. 58. Minnesota Population Center University of Minnesota. IPUMS International. https://international.ipums.org/ international/ [Accessed on 23 October 2014]. 59. Gomatam S, Karr A, Reiter J, Sanil A. Data dissemination and disclosure limitation in a world without microdata: a risk-utility framework for remote access systems. Statistical Science 2005; 20:163–177. 60. Sparks R, Carter C, Donnelly J, O’Keefe CM, Duncan J, Keighley T, McAullay D. Remote access methods for exploratory data analysis and statistical modelling: privacy-preserving analytics™. Computer Methods and Programs in Biomedicine 2008; 91:208–222. 61. Karr A, Lee J, Sanil A, Hernandez J, Karimi S, Litwin K. Web-based systems that disseminate information but protect confidentiality. In Advances in Digital Government: Technology, Human Factors and Public Policy, McIver W, Elmagarmid A (eds). Kluwer: Amsterdam, 2002; 181–196. 62. Karr AF, Dobra A, Sanil AP. Table servers protect confidentiality in tabular data releases. Communications of the ACM 2003; 46(1):57–58. 63. Australian Bureau of Statistics. Remote Access Data Laboratory (RADL). http://www.abs.gov.au [Accessed on 23 October 2014]. 64. Thompson G, Broadfoot S, Elazar D. Methodology for automatic confidentialisation of statistical outputs from remote servers at the Australian Bureau of Statistics. Joint UNECE/Eurostat work session on statistical data confidentiality (Ottawa, Canada, 28–30 October 2013). 37pp. 65. Lucero J, Zayatz L, Singh L, You J, DePersio M, Freiman M. The current stage of the microdata analysis system at the U.S. Census Bureau. Proceedings of the 58th Congress of the International Statistical Institute, ISI 2011, Dublin, Ireland, 2011. 66. Sarathy R, Muralidhar K. Evaluating Laplace noise addition to satisfy differential privacy for numeric data. Transactions in Data Privacy 2011; 4:1–17. 67. Yang X, Fienberg SE, Rinaldo A. Differential privacy for protecting multi-dimensional contingency table data: extensions and applications. Journal of Privacy and Confidentiality 2012; 4(1):101–125. 68. O’Keefe CM, Westcott M, Ickowicz A, O’Sullivan M, Churches T. Protecting confidentiality in statistical analysis outputs from a virtual data centre. Working Paper 29-30 October 2013. Joint UNECE/Eurostat work session on statistical data confidentiality. Ottawa, Canada, 10pp. http://www.unece.org/stats/documents/2013.10.confidentiality.html [Accessed on 23 October 2014]. 69. University of Chicago, NORC website. http://www.norc.org [Accessed on 23 October 2014]. 70. Sax Institute, Secure Unified Research Environment (SURE). http://www.sure.org.au [Accessed on 23 October 2014]. 71. O’Keefe CM, Gould P, Churches T. Comparison of two remote access systems recently developed and implemented in Australia. In Privacy in Statistical Databases PSD2014, LNCS, Domingo-Ferrer J (ed.), Vol. 8744. Springer: Heidelberg, 2014; 299–311. 72. Office for National Statistics. http://www.statistics.gov.uk [Accessed on 23 October 2014]. 73. UK Data Archive. Secure data service website. http://ukdataservice.ac.uk/get-data/how-to-access/accesssecurelab.aspx [Accessed on 23 October 2014]. 74. United States Census Bureau. http://www.census.gov [Accessed on 23 October 2014]. 75. Australian Bureau of Statistics. http://www.abs.gov.au [Accessed on 23 October 2014].

C. M. O’KEEFE AND D. B. RUBIN

3102

76. Skinner C, Elliot M. A measure of disclosure risk for microdata. Journal of the Royal Statistical Society Series B 2002; 64:855–867. 77. Australian Government National Health and Medical Research Council. National statement on ethical conduct in human research 2007 - updated 2014. Online publication Reference Number E72, 2007. http://www.nhmrc.gov.au/guidelines/ publications/e72 [Accessed on 23 October 2014]. 78. McGuire A, Genetics Gibbs R. No longer de-identified. Science 2006; 312:370–371. 79. Ohm P. Broken promises of privacy: responding to the surprising failure of anonymization. UCLA Law Review 2010; 57:1701–1777. 80. Rothstein M. Is deidentification sufficient to protect health privacy in research? American Journal of Bioethics 2010; 10:3–11. 81. El Emam K. Methods for de-identification of electronic health records for genomic research. Genome Medicine 2011; 3:25. DOI: 10.1186/gm239. 82. Population Health Research Network. http://www.phrn.org.au/ [Accessed on 23 October 2014]. 83. British Columbia Linked Health Database (BCHLD). http://riskfactor.cancer.gov/tools/pharmaco/epi/british_columbia. html [Accessed on 23 October 2014]. 84. Lin Z, Hewett M, Altman R. Using binning to maintain confidentiality of medical data. Annual Symposium Proceedings, American Medical Informatics Association, San Antonio, USA, 2002, 454–458. 85. Fienberg S, Makov U. Confidentiality, uniqueness and disclosure limitation for categorical data. Journal of Official Statistics 1998; 14:385–397. 86. Rubin D. Discussion: statistical disclosure limitation. Journal of Official Statistics 1993; 9:462–468. 87. Little R. Statistical analysis of masked data. Journal of Official Statistics 1993; 9:407–426. 88. Reiter J. Using CART to generate partially synthetic public use microdata. Journal of Official Statistics 2005; 21:441–462. 89. Iacus S, Porro G. Missing data imputation, matching and other applications of random recursive partitioning. Computational Statistics and Data Analysis 2007; 52:773–789. 90. Burgette L, Reiter J. Multiple imputation for missing data via sequential regression trees. American Journal of Epidemiology 2010; 172:1070–1076. 91. Drechsler J, Reiter J. An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computational Statistics and Data Analysis 2011; 55:3232–3243. 92. National Institute of Statistical Sciences. NISSWebSwap, Version 1.1, 2003. http://nisla05.niss.org/WebServices/dg/ WebSwap.html [Accessed on 23 October 2014]. 93. Abowd JM, Stinson M, Benedetto G. Final report to the Social Security Administration on the SIPP/SSA/IRS public use file project. Technical Report, 2006. 94. Hawala S. Producing partially synthetic data to avoid disclosure. Proceedings of the Joint Statistical Meetings, American Statistical Association, Alexandria, VA, 2008. 95. An D, Little R. Multiple imputation: an alternative to top coding for statistical disclosure control. Journal of the Royal Statistical Society Series A 2007; 170:923–940. 96. An D, Little R, McNally J. A multiple imputation approach to disclosure limitation for high-age individuals in longitudinal studies. Statistics in Medicine 2009; 29:1769–1778. 97. Lochner K, Hummer R, Bartee S, Wheatcroft G, Cox C. The public-use national health interview survey linked mortality files: methods of reidentification risk avoidance and comparative analysis. American Journal of Epidemiology 2008; 168:336–344. 98. Boulos MN, Curtis AJ, AbdelMalik P. Musings on privacy issues in health research involving disaggregate geographic data about individuals. International Journal of Health Geographics 2009; 8(1):46. 99. VanWey L, Rindfuss R, Gutmann M, Entwisle B, Balk D. Confidentiality and spatially explicit data: Concerns and challenges. Proceedings of the National Academy of Sciences of the United States of America 2005; 102:15337–15342. 100. Curtis A, Mills J, Agustin L, Cockburn M. Confidentiality risks in fine scale aggregations of health data. Computers, Environment and Urban Systems 2011; 35:57–64. 101. Cassa C, Grannis S, Overhage J, Mandl K. A context-sensitive approach to anonymizing spatial surveillance data: impact on outbreak detection. Journal of the American Medical Informatics Association 2006; 13:160–165. 102. Leitner M, Curtis A. Cartographic guidelines for geographically masking the locations of confidential point data. Cartographic Perspectives 2004; 49:22–39. 103. Olson K, Grannis S, Mandl K. Privacy protection versus cluster detection in spatial epidemiology. American Journal of Public Health 2006; 96:2002–2008. 104. Zhou Y, Dominici F, Louis TA. A smoothing approach for masking spatial data. The Annals of Applied Statistics 2010; 4(3):1451–1475. 105. Wang H, Reiter JP. Multiple imputation for sharing precise geographies in public use data. The Annals of Applied Statistics 2012; 6(1):229–252. 106. Dalenius T, Reiss SP. Data-swapping: a technique for disclosure control. Journal of Statistical Planning and Inference 1982; 6:73–85. 107. Cox L. Linear sensitivity measures in statistical disclosure control. Journal of Statistical Planning and Inference 1981; 5:153–164. 108. Robertston D, Ethier R. Cell suppression: experience and theory. In Inference Control in Statistical Databases, Domingo-Ferrer J (ed.), Lecture Notes in Computer Science, vol. 2316. Springer-Verlag: Berlin Heidelberg, 2002; 213–239. 109. Gouweleeuw J, Kooiman P, DeWolf LWPP. Post randomisation for statistical disclosure control: theory and implementation. Journal of Official Statistics 1998; 14:463–478. 110. Marley J, Leaver V. A method for confidentialising user-defined tables: statistical properties and a risk-utility analysis. Proceedings of the 58th Congress of the International Statistical Institute, ISI 2011, Dublin, Ireland, 21–26 August 2011.

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

C. M. O’KEEFE AND D. B. RUBIN 111. 112. 113. 114. 115.

116. 117. 118. 119. 120. 121. 122. 123.

124. 125. 126. 127.

ESSnet-project. 𝜏-ARGUS website. http://neon.vb.cbs.nl/casc/tau.htm [Accessed on 23 October 2014]. Australian Bureau of Statistics. Census TableBuilder nd. http://www.abs.gov.au [Accessed on 23 October 2014]. Reiter J. Model diagnostics for remote-access regression systems. Statistics and Computing 2003; 13:371–380. Reiter J, Kohnen C. Categorical data regression diagnostics for remote systems. Journal of Statistical Computation and Simulation 2005; 75:889–903. Sparks R, Carter C, Donnelly J, Duncan J, O’Keefe CM, Ryan L. A framework for performing statistical analyses of unit record health data without violating either privacy or confidentiality of individuals. Proceedings of the 55th Session of the International Statistical Institute, Sydney, 2005; 4:(4pp). O’Keefe CM, Chipperfield JO. A summary of attack methods and protective measures for fully automated remote analysis systems. International Statistical Review 2013; 81:426–455. O’Keefe CM, Shlomo N. Applicability of confidentiality methods to personal and business data. In Privacy in Statistical Databases PSD2014, LNCS, Domingo-Ferrer J (ed.), Vol. 8744. Springer: Heidelberg, 2014; 350–363. Reiter JP. Statistical approaches to protecting confidentiality for microdata and their effects on the quality of statistical inferences. Public Opinion Quarterly 2012; 76(1):163–181. Kennickell A, Lane J. Measuring the impact of data protection techniques on data utility: evidence from the survey of consumer finances. In Privacy in Statistical Databases. Springer: Heidelberg, 2006; 291–303. Drechsler J. Multiple imputation in practice–a case study using a complex German establishment survey. Advances in Statistical Analysis 2011; 95:1–26. Reiter JP, Raghunathan TE. The multiple adaptations of multiple imputation. Journal of the American Statistical Association 2007; 102(480):1462–1471. Domingo-Ferrer J, Torra V. A critique of the sensitivity rules usually employed for statistical table protection. International Journal of Uncertainty, Fuzziness and Knowlege-Based Systems 2002; 10:545–556. Cox L. Confidentiality issues for statistical database query systems. Invited Paper for Joint UNECE/Eurostat Seminar on Integrated Statistical Information Systems and Related Matters (ISIS2002), Geneva, Switzerland, April 2002, 17–19. http://www.unece.org/stats/documents/ces/sem.47/15.s.e.pdf [Accessed on 23 October 2014]. O’Keefe CM. Discussion. International Statistical Review 2011; 79(2):191–194. O’Keefe CM, Good N. Risk and utility of alternative regression diagnostics in remote analysis servers. Proceedings of the 55th Session of the ISI International Statistical Institute, Lisbon, Portugal, 2007:(8pp). Yu F, Fienberg SE, Slavkovi´c AB, Uhler C. Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of Biomedical Informatics 2014; 50:133–141. Reiter J, Oganian A, Karr A. Verification servers: enabling analysts to assess the quality of inferences from public use data. Computational Statistics and Data Analysis 2009; 53:1475–1482.

3103

Copyright © 2015 John Wiley & Sons, Ltd.

Statist. Med. 2015, 34 3081–3103

Individual privacy versus public good: protecting confidentiality in health research.

Health and medical data are increasingly being generated, collected, and stored in electronic form in healthcare facilities and administrative agencie...
386KB Sizes 0 Downloads 12 Views