MAIN PAPER (wileyonlinelibrary.com) DOI: 10.1002/pst.1615

Published online 25 March 2014 in Wiley Online Library

Preparing individual patient data from clinical trials for sharing: the GlaxoSmithKline approach Sara Hughes,a * Karen Wells,b Paul McSorley,c and Andrew Freemand In May 2013, GlaxoSmithKline (980 Great West Road, Brentford, Middlesex, TW8 9GS, UK) established a new online system to enable scientific researchers to request access to anonymised patient level clinical trial data. Providing access to individual patient data collected in clinical trials enables conduct of further research that may help advance medical science or improve patient care. In turn, this helps ensure that the data provided by research participants are used to maximum effect in the creation of new knowledge and understanding. However, when providing access to individual patient data, maintaining the privacy and confidentiality of research participants is critical. This article describes the approach we have taken to prepare data for sharing with other researchers in a way that minimises risk with respect to the privacy and confidentiality of research participants, ensures compliance with current data privacy legal requirements and yet retains utility of the anonymised datasets for research purposes. We recognise that there are different possible approaches and that broad consensus is needed. Copyright © 2014 John Wiley & Sons, Ltd. Keywords: data-anonymisation; data-sharing; privacy; transparency

1. INTRODUCTION

Pharmaceut. Statist. 2014, 13 179–183

a

Clinical Statistics, GSK Research and Development, Stockley Park West, Uxbridge, Middlesex UB11 1BT, UK

b

Statistical Governance, GSK Research and Development, Stockley Park West, Uxbridge, Middlesex UB11 1BT, UK

c Clinical Statistics, GSK Research and Development, 5 Moore Drive, Research Triangle Park, NC 27709, USA d Office of the Chief Medical Officer, GSK Research and Development, 980 Great West Road, Brentford, Middlesex TW8 9GS, UK

*Correspondence to: Sara Hughes, Clinical Statistics, GSK Research and Development, Stockley Park West, Uxbridge, Middlesex UB11 1BT, UK. E-mail: [email protected]

Copyright © 2014 John Wiley & Sons, Ltd.

179

Greater transparency and access to clinical trial data is a goal for many bodies and institutions currently (e.g. [1–4]). In line with this goal, GlaxoSmithKline recently made a commitment to provide access to anonymised individual patient data from clinical trials. The scope of the commitment and the approach the company is taking were described in an article by Nisen and Rockhold [5]. Providing access to individual patient data collected in clinical trials enables the conduct of further research that may advance medical science or improve patient care. In turn, this helps ensure that the data provided by research participants are used to maximum effect in the creation of new knowledge and understanding. Finally, it also enables independent review of results from clinical trials in order to validate the findings and, in so doing, strengthens trust in clinical research through enhanced openness and transparency. Although there are clear benefits to providing greater access to individual patient level data, there are a number of aspects that need to be carefully considered. These include providing access in ways where risks to patient privacy and confidentiality are minimised and adhering to commitments made to patients via informed consent processes. It is also important to consider a framework which promotes the goal that further research conducted with individual patient level data from clinical trials is scientifically valid and the interpretations and conclusions drawn from the research are appropriate. These considerations have pointed GlaxoSmithKline towards a model where researchers submit proposals to access data which are reviewed by an independent panel and, where approved, access is provided to anonymised patient level data in a password-protected IT environment which is only accessible through a secure

internet connection. This model is consistent with the recent recommendation from the UK parliamentary report on clinical trials and the subsequent Government response which stated that anonymised patient level data should not be put in the public domain but instead accessed in secure ‘safe havens’, with independent review to ensure the proposed research makes a useful contribution to science [6,7]. The method for anonymising individual patient data prior to sharing is a critical element of the process. A robust approach is needed to minimise the risk of breaches of patient confidentiality, while retaining the scientific advantages of analyses based on individual patient data [8]. There are various privacy laws and regulatory guidance which must be followed [9,10]; there are also published proposed approaches to clinical trial data anonymisation. One publication relates to the specific situation where data will be published and therefore freely available [11] and another proposed by an academic research organisation in support of their focus to share knowledge and data when possible [12]. The challenge is clear: anonymising data sufficiently in order to

S. Hughes et al. protect patient confidentiality and ensure compliance with data privacy legal requirements while retaining scientific value of the anonymised datasets. This article describes the approach we, at GlaxoSmithKline, are taking to the anonymisation of individual patient data from clinical trials in preparation for sharing with other researchers. In outlining our approach, the intent is to generate discussion and debate with the ultimate goal of moving towards one common method for anonymisation of individual patient clinical trial data regardless of the source and sponsor of the clinical trial.

2. APPROACH 2.1. Which datasets Access to an anonymised version of the following electronic datasets for each clinical trial (where available) is provided following approval of a research proposal by an Independent Review Panel and receipt of a signed data sharing agreement [5]: 



Raw study datasets. These are the data collected for each patient in the clinical trial, for example, system independent datasets or Clinical Data Interchange Standards Consortium (CDISC) study data tabulation model datasets. Analysis-ready datasets. These are the datasets used for statistical analysis, for example, analysis and reporting datasets or CDISC analysis data model datasets.

Other supporting documents are provided (protocol, statistical analysis plan, clinical study report (CSR), blank annotated case record form and data specifications). These are all redacted to remove personally identifiable information as appropriate. Note that clinical trials of rare diseases are not listed on the clinical data request site (https://clinicalstudydatarequest.com). This is because anonymisation of these data is more difficult to achieve. For these studies, feasibility of anonymisation will be assessed as part of the review of enquiries about access to these data (the request site provides the facility to submit enquiries for any clinical trials not listed as having the data available). 2.2. General approach The high-level approach to anonymisation involves the following:

180

(1) Removing personally identifiable information (PII) from the datasets (see Section 2.3 for details). This includes recoding identifiers (by replacing the original code number with a new code number), removing free text verbatim terms, replacing date of birth with year of birth or age and replacing all other dates relating to individual subjects with dummy dates. (2) Destroying the link (code key) between the datasets that are provided and the original datasets. Some data protection authorities in Europe suggest that the data can only be considered anonymised if personal information is removed (or redacted) and the subject code number cannot be linked to a research participant [13]. Therefore, research participants’ identification code numbers are anonymised by replacing the original code number with a new code number and destroying the code key that was used to generate the new code number from the original (i.e. destroying the link between the two code numbers).

Copyright © 2014 John Wiley & Sons, Ltd.

2.3. Removing personally identifiable information from the datasets In the USA, federal privacy law [(the Health Insurance Portability and Accountability Act (HIPAA)] has established methods for the de-identification of an individual’s health information. One of these methods requires the removal of 18 specific identifiers. Our approach follows this HIPAA methodology [10,14]. The 18 identifiers as defined by HIPAA are removed from the datasets (and related documentation). In addition, any other PII that may be present is removed. This approach is described at a high level in Section 2.2. Each of these steps is described in further detail in the succeeding text. 2.3.1. Recoding identifiers (or code numbers). The following identifiers are re-coded and the code key that was used to generate the new code number from the original code number is destroyed (as described in Section 2.5) to provide the following:    

A new subject identifier (or code number) for each research participant. A new investigator identifier (or code number) for each investigator. The investigator name is set to blank. A new laboratory identifier for each laboratory. A new centre identifier for each centre.

The same new identifiers (or code numbers) are used across all datasets applicable to a single study, for example, raw dataset and analysis-ready dataset. This includes (where applicable) pharmacokinetic datasets, genetic datasets and so on. Extension studies use the same new identifiers (or code numbers) as used for the initial study to enable individual subject data to remain linked. This also applies to long-term follow-up studies where separate reports are published. This is achieved by repeating the data anonymisation process for the initial study data at the same time as the extension/follow-up data. Note that in the CSR, subject identifiers, investigator identifiers, centre identifiers and laboratory identifiers are all redacted. 2.3.2. Removing free text verbatim terms. Free text verbatim terms are set to ‘blank’, including the following:    

Adverse events Medications Other, for example, medical history Other specific verbatim free text

Certain free text fields may be retained if removal of these fields would impact the scientific value of the dataset. These fields are reviewed to ensure they do not contain PII. All dictionary-coded terms (with decode) and/or verbatim terms that use a pre-specified list are retained. Note that in the CSR, free text fields are typically not reported but if they are in the CSR these will be redacted. CSR patient listings and any associated free text (eg in Appendices) are removed. 2.3.3. Replacing date of birth with year of birth. Date of birth is replaced with year of birth with the exception of ages older than 89 years, which are aggregated into a single category of ‘90 or older’.

Pharmaceut. Statist. 2014, 13 179–183

S. Hughes et al. 2.3.4. Replacing all original dates relating to a research participant. All dates related to a research subject’s participation in the clinical trial are replaced. A random offset is generated for each research participant and applied to all dates for that research participant. All original dates are replaced with the new dummy dates so that the relative times of all events/observations/interventions and so on for each research participant are retained. Example: If the original reference date was 1 April 2008 and the date of death was 1 May 2008, a random offset is generated (e.g. 91 days). Dummy dates are then calculated using this offset of 91 days as illustrated in Table I. 2.3.5. Reviewing and removing other PII. Any other data elements that contain PII are removed. For example:      

Any names and initials (e.g. any investigator names and any subject names and initials). Information from variable names (e.g. laboratory names may contain location information). Other geographic information such as place of work (e.g. if socioeconomic data is collected). Investigator comments which may identify a subject. Genetic data that would enable a direct trace back to an individual subject. Kit numbers and device numbers (e.g. container numbers and lab sample numbers)

Table II illustrates a nonreal example of how some of the aforementioned steps are applied. 2.4. Review and quality control A final review of the HIPAA 18 identifiers is made to determine if further removal is required. Quality control checks are conducted for the processing of the data and supportive metadata documentation. 2.5. Destroying the link between anonymised and original datasets Research participants’ identification code numbers are anonymised by replacing the original code number with a new code number (as described in Section 2.3) and destroying the code key that was used to generate the new code number from the original (i.e. destroying the link between the two code numbers). The following specific items are discarded:    

Any transactional copies of anonymised datasets. De-identification tables (links for original variable and new anonymised variable). Any quality control output datasets. Any SAS (SAS Institute, Cary, NC, USA) programming log files.



The seed utilised for random number generation.

The anonymised datasets are stored in a separate secure location to the original coded datasets.

3. DISCUSSION The approach described in this paper is one possible method for clinical trial dataset anonymisation. Other approaches could remove or change more, or less, of the original data. As previously outlined, our approach was based on balancing the need to anonymise data sufficiently in order to protect patient confidentiality and ensure compliance with data privacy legal requirements, while retaining scientific value of the anonymised datasets. There are instances when the approach described in this paper will result in limitations for some specific research questions. For example, data will not be available to support research questions for which actual calendar date is important (e.g. some aspects of research for seasonal illnesses such as flu or allergic rhinitis). Likewise, our approach of removing most free text from the datasets rules out possible research where this information could be of value (e.g. text mining of adverse event verbatim text). However, in these relatively early days of broad access to raw clinical trial data, we believe that this approach appropriately balances the needs of prospective researchers with the needs of clinical trial participants’ privacy. Although removing most free text could be seen as too conservative, we believe it is preferable to err on the side of privacy; the provision of specific free text fields can be discussed and, where appropriate, revisited. The alternative is not true—we cannot release PII and then, after-the-fact, ‘unrelease’ it. Our approach is broadly similar to other published approaches to data anonymisation; however, we differ in some of the details. We have taken the HIPAA list of 18 identifiers and followed those guidelines where they apply. It is important to note that a number of the items on the HIPAA list do not exist in clinical trial datasets. For data being made publicly available in an uncontrolled way, Hrynaszkiewicz and colleagues [11] recommended that in addition to direct identifiers being removed from datasets, datasets which contain three or more indirect identifiers should be reviewed by an independent researcher or ethics committee before being submitted for publication. We have not taken such a formulaic approach but have removed or modified direct and some indirect identifiers in order to reduce the risks to data privacy. For example, we have removed the direct identifiers that exist within our datasets by either setting as blank or in the case of coded identifiers have replaced with a new randomly generated identifier with destruction of the key linking those identifiers. For dates, we have followed the approach suggested by Hrynaszkiewicz et al. by assigning a randomly generated offset, which we then apply to all dates. In terms of their list of indirect identifiers, we have removed or modified identifiers that are cov-

Table I. Applying random offset dates for research participants.

Pharmaceut. Statist. 2014, 13 179–183

New date

1 April 2008 1 May 2008 30 days

1 July 2008 31 July 2008 30 days

Apply random offset = 91 days Apply random offset = 91 days

181

Reference date Date of death Relative time of death

Original date

Copyright © 2014 John Wiley & Sons, Ltd.

Table II. An example using fictitious data to illustrate the removal of personally identifiable information.

S. Hughes et al.

182

Copyright © 2014 John Wiley & Sons, Ltd.

Pharmaceut. Statist. 2014, 13 179–183

S. Hughes et al. ered by the HIPAA list,for example, locations, date of birth and most free text verbatim responses. We also do not list our studies of rare disease for data sharing. If a specific request is made for such data, we will review the feasibility of anonymisation of the data on a case by case basis. For the other indirect identifiers, we believe that they can be retained provided other identifiers such as locational information and date of birth are removed. In terms of the Shostak paper [12], we have followed a similar approach in terms of creating specific programming macros to enable consistent and efficient anonymisation of datasets. We have followed a similar approach to the recoding of subject identifiers. However, we preferred the approach suggested by Hrynaszkiewicz et al. for dealing with dates, that is, assigning a randomly generated offset to all dates, rather than computing a study day variable to replace all dates. We believe the offset approach is simple to implement and minimises the additional processing required to anonymise the dates. As more clinical trial sponsors share anonymised clinical trial data, and as the research community gains experience from creating and using anonymised datasets, it is inevitable that our approach to data anonymisation will evolve. What is clear, however, is that a long-term goal should be to have common standards for data anonymisation that all trial sponsors follow. This would facilitate appropriate pooling of data across sponsors and avoid misunderstanding or misinterpretations of data pooled using different approaches to anonymisation.

Acknowledgements The authors would like to acknowledge our GSK colleagues Robert Frost, Russell Brooks, Crystal Baker, Jodie Spence, Brigitte Cheuvart, Timothy Kelly, Amit Bhattacharyya, Kipp Spanbauer and Max Cherny for their various contributions to the development of the process outlined in this paper.

REFERENCES [1] European Medicines Agency releases for public consultation its draft policy on the publication and access to clinical-trial data. Press release available from: http://www.ema.europa.eu/ema/index.jsp? curl=pages/news_and_events/news/2013/06/news_detail_001825. jsp&mid=WC0b01ac058004d5c1, last accessed 15th November 2013. [2] Godlee F. Clinical trial data for all drugs in current use. British Medical Journal 2012; 345:7–10.

[3] European Federation of Pharmaceutical Industries and Associations & Pharmaceutical Research & Manufacturers of America. Principles for responsible data sharing, July 2013. Available from http://transparency.efpia.eu/uploads/Modules/Documents/ data-sharing-prin-final.pdf, last accessed 22nd November 2013. [4] United States Food & Drugs Administration. Availability of masked and de-identified non-summary safety and efficacy data: request for comments, April 2013. Available from https://www.federalregister.gov/articles/2013/06/04/2013-13083/ availability-of-masked-and-de-identified-non-summary-safety-andefficacy-data-request-for-comments, last accessed 22nd November 2013. [5] Nisen P, Rockhold F. Access to patient-level data from GlaxoSmithKline clinical trials. New England Journal of Medicine 2013; 369:475–478. [6] United Kingdom House of Commons Science and Technology Committee: third report on clinical trials. published 9 September 2013. Available from http://www.publications.parliament.uk/pa/ cm201314/cmselect/cmsctech/104/10402.htm, last accessed 15th November 2013. [7] United Kingdom Department of Health. Government response to the House of Commons Science and Technology Committee inquiry into clinical trials, November 2013. Available from https://www.gov.uk/government/publications/clinical-trialsinquiry-government-response, last accessed 22nd November 2013. [8] Vallance P, Chalmers I. Secure use of individual patient data from clinical trials. Lancet 2013; 382:1073–4. [9] Article 29 Data Protection Working Party. Opinion 4/2007 on the concept of personal data, adopted 20th June 2007. 01248/07/EN WP 136. Available from http://ec.europa.eu/justice/policies/ privacy/docs/wpdocs/2007/wp136_en.pdf, last accessed 15th November 2013. [10] United States Department of Health and Human Services. Code of Federal Regulations. Title 45: public welfare, Subtitle A §164.514, 2011 Edition. Available from http://www.gpo.gov/fdsys/pkg/ CFR-2011-title45-vol1/pdf/CFR-2011-title45-vol1-sec164-514.pdf, last accessed 15th November 2013. [11] Hrynaszkiewicz I, Norton ML, et al. Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers. British Medical Journal 2010; 340:304–307. [12] Shostak J. De-identification of Clinical Trials Data Demystified. Durham, NC: Duke Clinical Research Institute (DCRI). Available from http://www.lexjansen.com/pharmasug/2006/ publichealthresearch/pr02.pdf, last accessed 15th November 2013. [13] PRIVIREAL Privacy in Research Ethics and Law. Recommendations from PRIVIREAL to the European Commission. Available from http:// www.privireal.org/content/recommendations/#Recc, last accessed 15th November 2013. [14] United States Department of Health & Human Services. Guidance regarding methods for de-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) privacy rule. Available from http:// www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/ De-identification/guidance.html, last accessed 15th November 2013.

183

Pharmaceut. Statist. 2014, 13 179–183

Copyright © 2014 John Wiley & Sons, Ltd.

Preparing individual patient data from clinical trials for sharing: the GlaxoSmithKline approach.

In May 2013, GlaxoSmithKline (980 Great West Road, Brentford, Middlesex, TW8 9GS, UK) established a new online system to enable scientific researchers...
433KB Sizes 2 Downloads 3 Views