Article

What you see is not what you get in the PDF document format

Health Informatics Journal 17(1) 24­–32 © The Author(s) 2011 Reprints and permission: sagepub. co.uk/journalsPermissions.nav DOI: 10.1177/1460458210397851 jhi.sagepub.com

Mads R. Dahl, Eivind O. Simonsen and Christian B. Høyer Aarhus University, Denmark

Abstract The sharing and storage of scientific knowledge, information and data are today mainly in digitized form, which will become the predominant means of communicating scientific work in the future. One of the bestestablished formats is the open standard of PDF (Portable Document Format), which is renowned for its flexibility and stability. In this article, we expose a major flaw in the format with respect to the security of confidential information, such that even organizations responsible for safeguarding and setting the standards for data management were unintentionally revealing confidential patient data. By collecting and analysing a random sample of files from a health informatics organization, we demonstrate the extent of the problem and determine its cause by code analysis of an example. In conclusion, we suggest the development of a knowledge-sharing format that does not demand expert skills for safe usage: WYSIWYS (What You See Is What You Store).

Keywords confidentiality, data security, health informatics, PDF format

Introduction The extensive accumulation of data in the information society includes an immense amount of data on individuals. In most cases, the use of the data requires an unambiguous identification of each individual. Without proper identification it will be difficult, if not impossible, to maintain relationships between individuals and companies (such as banks, employers, and phone companies), as well as between citizens and public authorities (e.g. the social security system, courts, and healthcare). Without a doubt, data about individuals’ healthcare status is sensitive and should by all means be kept confidential, while other data, such as telephone numbers, may not be considered sensitive. Persons managing confidential data, such as healthcare information, feel obliged to

Corresponding author: Mads R. Dahl, MSc PhD MI, Assistant Professor, Institute of Public Health, Section for Health Informatics, Aarhus University, Vennelyst Boulevard 6, 8000 Aarhus C, Denmark. Email: [email protected].

Downloaded from jhi.sagepub.com at WAYNE STATE UNIVERSITY on April 14, 2015

25

Dahl et al.

maintain confidentiality by managing the data in a secure way. However, the use of data for purposes such as research, education, and status reports implies publication of results, frequently electronically as papers, reports, or presentations, and subsequent posting of these on the Internet. The publication of results is intended, but the publisher probably does not realize that the basis for the results – the actual data – may, because of flaws in a popular document format, also be at risk of becoming available to Internet users. Exposure of healthcare data may have severe consequences to the individuals whose data are exposed. Family, jobs, or money may be lost if information about abuse of alcohol or drugs, venereal disease, or psychiatric disease becomes known to family, friends, colleagues, or employers. In a more global perspective, there may be severe costs to society if the confidentiality between patients and the healthcare system is ruined, for example, if citizens do not dare to be tested for HIV (human immunodeficiency virus) because they do not trust the results to be kept confidential. Although there is much focus on data safety, exposure of confidential information to the public may – and does – happen. This is sometimes caused by simple negligence, but probably most often happens unknowingly. While some people are basically unaware of the risk of exposing data, others pay close attention to protecting data. Unfortunately, this may demand specialized technical insight to ensure that confidential information is actually protected. The objective of this article is to demonstrate how confidential data security may be jeopardized by the PDF file ‘multi-layer’ format. The prevalence of this problem will be quantified by analysis of a sample of PDF files downloaded from the website www.medcom.dk. Furthermore, this article has suggestions on how to avoid confidential data exposure in graphical elements of a PDF file.

Background The Portable Document Format (PDF) was developed by Adobe Systems in 1993. At this time, PDF has seized the dominant position worldwide in the distribution of documents.1 An Internet search for PDF files gives direct access to more than 1.5 billion documents (search engine: Yahoo) that are available for download to any given computer. These only represent a fraction of the actual number of files stored and exchanged. Since PDF has recently become an ISO 32000-1 standard, development and documentation is managed by the International Organization for Standardization (ISO).2 Several factors have influenced the enormous use of PDF: (1) availability, (2) price, (3) usability, (4) file size, and (5) security. PDF is independent of operating systems, as the free-of-charge Adobe Reader is available for Windows, MacOS, UNIX, and Linux. The document can be displayed as originally produced by the author, since layout, text, images, and graphical objects are maintained.3 Another advantage of PDF is the ability to create files smaller than the original format. Shrinking the original file is only possible by applying several different compression methods in the production of the final PDF document, while conserving layout and machine-searchable text. Thus, software made to produce and/or read PDF files relies on a wide range of compression and decompression filters, and encryption and decryption technologies.3 A PDF document is generated by converting the original document elements into a collection of linked objects and layers whose data represent the text and graphical elements, divided into a range of subtypes and formats.1 The indirect objects are controlled by a cross-reference table/dictionary generated at the end of the PDF file. When a PDF file is accessed for onscreen viewing, this is accomplished by applying decoding filters to the file objects in content streams for each page of the document. Furthermore, the content streams can include so-called operators controlling nonobject elements.1 The complex PDF structure supports several security levels, as files can be

Downloaded from jhi.sagepub.com at WAYNE STATE UNIVERSITY on April 14, 2015

26

Health Informatics Journal 17(1)

password protected and restrictions can be made on copying, printing and altering the files.4 The key issue is that PDF has ‘multi-layer’ capabilities but the average user is under the impression that PDF is a ‘single-layer’ format. To most end-users, PDF is an extremely useful tool in daily life, and they will probably never have to consider the underlying technology, as they probably only view PDF files. However, pitfalls exist if the end-user produces PDF files based on confidential data. Confidential data take on many forms; we will focus on data that can be directly related to single individuals who can be identified in a healthcare-related context. Every single citizen in Denmark is assigned an identification number in the Central Population Registry (CPR) at birth. The number is unique and can never be changed or reused; the CPR number is an unambiguous identification of one – and only one – person.5 The CPR number consists of 10 digits: the first six represent the date of birth (day, month, and year, two digits each), and these are followed by a serial number of four digits. The serial number is generated by an algorithm that makes it possible to verify whether the number is genuine, gives the century, and identifies the person’s gender (even numbers are assigned to women). For example, the (made-up) number 101010–1010 refers to a woman born on 10 October 1910. Today, the CPR number is the data index key to virtually all public services and registrations, and is essential for the success of research conducted on the various Danish registry databases.6 Due to the meticulous registration of the Danish population, healthcare research has a wealth of data to draw on, as the entire population can be coupled on an individual level to information in several public registries;7,8 it has even been postulated that ‘the entire country is a cohort’.9 The CPR number is to be considered as confidential personal information, and all relevant research has to be approved by the Danish Data Protection Agency. Clinical and epidemiological research on topics like birth, death, disease, fertility, and emigration/health data are only a few examples of the beneficial effects of the system.10–13 The Danish Medcom organization was established in 1995 as a collaboration of state organizations responsible for development, testing, implementation, and quality assurance of electronic communications within the Danish healthcare sector and electronic healthcare systems abroad. The purpose of Medcom is to set the standards for electronic communication that supports continuity in healthcare.14 Thus, Medcom is not directly involved in the hosting, operating or communicating of digital information between healthcare sectors. Medcom has pioneered the implementation of health informatics in Denmark and participated in a wide range of international activities. Medcom therefore serves as a role model and as the single most competent organization in Denmark when it comes to standardizing the handling of personal and confidential data.15 Thus, if Medcom unintentionally exposes confidential information through published presentations and documents in PDF files, other organizations or individuals can potentially make the same mistake. It is often advocated that conversion of primary document formats to Portable Document Format (PDF, secondary format) before publication is a secure way to strip documents of hidden data before publication. The world has seen the consequences of embedded information in PDF files from even military16 and governments,17 but computer users are left with the impression that that PDF is the safe standard to use for publication18 and many institutions and organizations are now using the PDF format as publication standard.19 During our general research into confidentiality risks, we have encountered many examples of hidden data in PDF files. Similar investigations have been conducted to outline the types of metadata that may be extracted from electronic publications23 and the development of scanning tools for hidden personal identifiable information in documents.24

Downloaded from jhi.sagepub.com at WAYNE STATE UNIVERSITY on April 14, 2015

27

Dahl et al.

Because of its role in setting standards for the electronic communication of healthcare data in the Danish healthcare system, Medcom was chosen as the test organization for this investigation. We emphasize, however, that none of the organizations we have analysed appears to be aware of security problems in their use of PDF.

Material and methods The prevalence of PDF files with confidential information, such as CPR numbers, was estimated by downloading a sample of files from the website http://www.medcom.dk/ for analysis. We illustrate the PDF problem using a publication from a leading journal publisher20 that had a wide range of objects encoded in different streams: a clearly unintended breach of confidentiality. We used Yahoo.com as the search engine and limited the search to the domain www.medcom. dk and the letter ‘1’. Yahoo provides certain web services for application developers to use in their own code. For this project, we used the web search service V1/webSearch.html, which allowed us to perform a search via the Yahoo search engine and receive a response in a file index format, which can be structured for multiple download sequences. Two parameters are required when using the web search service: appid –, the name you choose when registering for using the service, and query –, the search string. The query string was the number ‘1’ to ensure that we obtained the greatest diversity of files; furthermore, the file search was restricted to the domain www.medcom.dk. Files were downloaded twice, on 28 October 2008 and on 28 December 2008, and subsequently sorted by file type as well as time of upload to the web server. By visual inspection (using Adobe Reader), PDF files were sorted into two groups by content: (1) files containing only text and/or simple graphics as logos or line drawings, and (2) documents containing graphics and/or pictures. By manual review, the PDF files from the second group (files including graphics and/or pictures) were categorized into three subgroups according to their CPR numbers content: (1) none, (2) fictional CPR numbers, and (3) suspected genuine CPR numbers and names. To analyse the loading and displaying of the PDF onscreen, a high-speed camera was used to document the sequence in which each element was displayed.

Software and hardware PDF files were accessed using Adobe Acrobat 9 Pro (Adobe Inc.) and Adobe Reader 9 (Adobe Inc.). Graphical elements copied from PDF files were pasted into PowerPoint 2003 (Microsoft) for demasking. Digital images were cropped and rendered using Adobe Photoshop CS3 (Adobe Inc.) for PC. High-speed images were acquired using a Casio Exilim EX-FH20 camera with a burst rate of 40 pictures per second, capturing images as 3072 × 2304 pixels, JPEG format, with autofocus, auto balance, ISO 200, and F/3.9. The camera was mounted on a tripod (for stability) and photographs were taken in room light. This study was approved by the Danish Data Protection Agency. In order to view and analyse the code of the PDF file, one needs to open it in a suitable text editor. In our case, we used the free cross-platform Vi IMproved version 7.0.237 (www.vim.org) on a Red Hat Linux distribution. In a pure Windows environment, alternatives like Notepad++ (http:// notepad-plus.sourceforge.net/uk/site.htm) are also options. Opening a PDF file in a text editor gave a view of the code behind the PDF file. In this case, page 5 of the PDF file20 contains 14502 lines of code and reading those lines directly can be quite cumbersome, such that applying some decompression can be useful. The decompression made the file slightly larger – in this case by 2987 lines – but it also enabled navigation of the file itself, because the decompression process also divides the

Downloaded from jhi.sagepub.com at WAYNE STATE UNIVERSITY on April 14, 2015

28

Health Informatics Journal 17(1)

PDF code into pages. For this purpose, we used a free JAVA tool, Multivalent20060102.jar (http://multivalent.sourceforge.net/), to decompress the PDF file. Running this Java ARchive file will require the JAVA runtime environment to be installed.

Results A total of 2426 files were identified and downloaded. The documents had been accessible on the Internet in the period between 27 March 2001 and 28 October 2008. The distribution of file types and the sizes of files are given in Table 1. Of the 1484 PDF files downloaded, six were damaged and could not be analysed. Of the remaining 1478 undamaged files, 296 (20%) included screen dumps, graphics, and/or pictures, whereas the remaining 1182 files (80%) included only text and/or simple graphics (logos or line drawings). Based on the individual file analysis, the 296 PDF files containing graphical elements were categorized by the presence of CPR numbers. The majority (182 files, or 61%) did not include CPR numbers or included only fictional CPR numbers (101 files, or 34%). Suspected genuine names and CPR numbers were found in 13 files (4%). A total of 46 individuals could be identified by scrutiny of the 13 files (with between 1 and 31 CPR numbers in each file). In addition to name and CPR number, a range of other information on individuals was found: addresses, bank account details, and notes from electronic patient records, including information about diagnoses. A t-test of the mean values of file size and file embedded confidential data showed a significant p-value (p < 0.05) between the group of files with CPR numbers (13 files, mean 1999 kb) and the group not containing CPR numbers (1465 files, mean 312 kb). The mean values of the group of files with names and CPR numbers (13 files, mean 1999 kb) and the files containing graphical elements but no (or just fictional) CPR numbers (283 files, mean 904 kb) also showed a significant difference (p < 0.05) using the t-test. The journal PDF file we analysed20 was selected based on a manual analysis that determined the file to contain text, pictures, graphics and confidential information that could be traced back to an individual patient. We isolated page 109 of the document and saved it separately for the picture recording and subsequent analysis. The PDF file was chosen as a general example of journal publications. The sequence of loading and displaying the PDF file in Adobe Reader on an LCD screen is shown in Figure 1a–e. The result of the loading sequence displayed in Figure 1 illustrates the

Table 1.  Distribution of the 2426 files downloaded from http://www.medcom.dk/. The table shows the distribution of the files found according to file type. The table also shows the size of folders. File types

Number

Folder size (MB)

Average size (KB)

Max. size (KB)

Min. size (KB)

PDF XLS DOC HTML PPT ZIP Unknown

1,484 161 308 156 13 27 277

467 22 34 0 19 45 27

315 134 111 2 1,485 1,667 98

20,471 5,852 6,805 103 6,199 20,451 15,144

0 0 12 0 82 4 1

Downloaded from jhi.sagepub.com at WAYNE STATE UNIVERSITY on April 14, 2015

29

Dahl et al.

Figure 1.  Sequence of loading and displaying a PDF journal page in Adobe Acrobat Professional. In picture (a), the Adobe Acrobat program has just begun after double-clicking on the PDF file, and in (b), the first page layout begins processing. The different types of (c) text and pictures are then loaded, followed by (d) the display of simple graphics. Finally, (e) the page thumbnail on the left hand side is generated. (The eyes have been blurred in (c) to preserve anonymity).

problem of ensuring confidentiality in PDF. The identity of the masked individual is compromised in pictures 1b and 1c because of the way that the final image is rendered in PDF. Using the ‘image tool’ in Adobe Reader, it is also possible to copy a selected image from a PDF file. The tool is set to copy only the picture and not the inserted graphics used to mask an area. Thus, by making a copy of an image from a PDF file with overlaid graphics and pasting it into another software program, confidential information can be seen. This result can also be obtained by analysing the code for the document page, comprising 17,488 lines of code. The PDF file in Figure 1 has the image in question on page 5 of the paper.20 Opening the PDF file in the text editor, one can search for ‘page 5’ and locate the following code (on one line; here, the code is formatted for legibility): … (Line 1767)  % page 5 (Line 1768)  13 0 obj (Line 1778)  /ExtGState >

Downloaded from jhi.sagepub.com at WAYNE STATE UNIVERSITY on April 14, 2015

30

Health Informatics Journal 17(1)

(Line 1779)  /ColorSpace >>> (Line 1780)  /Parent 103 0 R>> (Line 1781)  endobj … This code fragment tells us that page 5 has five external objects that correspond with the five images on the page. The images themselves, named /Im54 to /Im58, are defined by an indirect reference to objects 15 0 to 19 0. To get a view of the full content on page 5, one has to look at the indirect reference 14 0, which is what the Contents keyword refers to. The stream object is complex and lengthy; thus only a portion is shown in the fragment. The Do operation of the external images /Im54 and /Im56 is invoked between lines 1767 and 1781 in the code for the PDF file (Figure 1b and c). The masking of the eyes (Figure 1d) is invoked in coding line 10135 to 10138 (data not shown). A content stream in a PDF file is handled in a sequential manner, and therefore the images will be placed onto the page before the rectangle that masks the eyes on one of the images. This is all done in a matter of milliseconds, but using a high-speed camera it is possible to capture the different image layers while a PDF file is rendered, and therefore it is also possible to get a clear view of the image before the eyes are masked out. The page thumbnail appearing in Figure 1e is coded separately (data not shown). The images on page 5 of the document are all defined as XObjects. This is because the images are of a certain nature and therefore will be treated as external objects to the file. XObjects are one of five possible types of graphical objects within a PDF file, the other four being path objects, text objects, inline images, and shading objects (8.2). The mask of one of the images on page 5 will be made into a path object and therefore is not a part of the XObject itself. This is obvious when the process of rendering the PDF is captured using a high-speed camera (Figure 1) because the objects in the content stream are rendered onscreen sequentially; we can observe that the rendering of the image itself and the masking of the image do not take place at the same time. The fact that the masking of the image is an individual path object within the PDF file is also the reason why it is possible to copy the image directly from the PDF file and into a picture editor, where one is able to see the picture as it was before the eyes were masked out. The copying procedure will only copy the image XObject and not the path object of the mask.

Discussion and conclusions In this article we have demonstrated a significant security problem in the use of PDF as a standard for publishing and archiving healthcare documents. We found that 13 out of 1484 PDF files contained the most confidential type of information in Denmark: the CPR number of a named individual. We also found a statistically significant variation in PDF file size between files containing and those not containing hidden confidential information that could be used as an indicator or alert for the author or publisher. Similar to the security problem in Microsoft PowerPoint with the embedding of OLE objects,21 we have shown how a document author can unintentionally compromise information security. Other studies have shown that end-users can have problems with well-known applications due to poor software design.22,25 The general user’s conception of PDF as a single-layer ‘digital photocopy’ of the original document has not progressed in line with developments in the format into a ‘multi-layer’ and diverse file format.

Downloaded from jhi.sagepub.com at WAYNE STATE UNIVERSITY on April 14, 2015

31

Dahl et al.

Our investigation shows that even expert organizations and publishers can unintentionally expose confidential information embedded in PDF files. Password protection, printing and copy restrictions etc. can all be evaded. This type of user-generated security risk is a consequence of poor usability of electronic document applications and tools. The security of confidential information used in preparing electronic documents can only be guaranteed when the applications used to generate these documents make it easy for their users to incorporate such security and to understand what the risks of using different formats might be. Denmark’s renown in registry research, based on the availability of the CPR number as a key cross-referencing tool, has been jeopardized by software developers’ lack of attention to usability. The modern healthcare sector is very much dependent on IT and health informatics, but very few healthcare workers have had any form of IT education. Healthcare professionals and researchers with access to confidential information must be able to use the technologies and applications available without having to understand how each pixel will be coded in the final file shared in the interests of us all. Other sectors have different types of confidential information that could also be at risk of exposure. The majority of compromising PDF files were PowerPoint presentations converted to PDF in line with the recommendations from the Danish Data Agency. Presentations often include pictures and screenshots from authentic applications and thus these publications should be treated with great circumspection. Multilayer images, compromising pictures, screenshots and similar illustrations should be carefully edited for confidential data and grouped prior to a novel rendering in a single-layer image format. Alternatively, the entire presentation can be saved as single layered images and subsequently inserted in a new PowerPoint using the photo album function26 before conversion to PDF. The acronym WYSIWYG (What You See Is What You Get) has been a backbone of document format development. We suggest development of standards with the acronym WYSIWYS (What You See Is What You Store). In the case of PDF files, this standard should be the default setting, thus ensuring that the document author does not inadvertently distribute confidential embedded information. In general, this type of document security should be incorporated in the application, without relying on user competence. If professionals who work with IT daily can make mistakes and publish documents containing confidential and sensitive data, general end-users cannot reasonably be expected to be aware of the very specific technical details and subtle settings that influence the final result. In Denmark the end-user is, by law, responsible for the content of documents, but we recommend that the software companies should have the responsibility for the security of the technology. This would give the application development companies an incentive to produce more user-friendly – and hence more secure – tools. Acknowledgments Peter Laursen and Søren Forchhammer.

Declaration of Interests None.

References 1. Adobe Systems Incorporated. Document Management: Portable Document Format, Part 1. 1st edn. PDF 1.7 PDF 32000–1:2008.

Downloaded from jhi.sagepub.com at WAYNE STATE UNIVERSITY on April 14, 2015

32

Health Informatics Journal 17(1)

2. Lazarte M. PDF format becomes ISO standard. ISO 32000–1:2008. 3. Ockerbloom JM. Archiving and preserving PDF files. RLG DigiNews 2001; 5 (1): 1–6. 4. Adobe PDF Security: Understanding and Using Security Features with Adobe Reader and Adobe Acrobat. Whitepaper, 2005. 5. Central Office of Civil Registration. The Civil Registration System in Denmark. Webpage 27 September 2001, accessed 11 April 2008. 6. Hallas J. Conducting pharmacoepidemiologic research in Denmark. J Pharmacoepi Drug Safety 2001; 10: 619–623. 7. Mortensen PB. Registerforskning i Danmark. Norsk Epidemiologi 2004; 14 (1): 21–124. 8. Pedersen CB, Gøtzsche H, Møller JO, and Mortensen PB. The Danish Civil Registration System: a cohort of eight million persons. Dan Med Bull 2006; 53 (4): 441–449. 9. Frank L. Epidemiology: when an entire country is a cohort. Science 2000; 287: 2398–2399. 10. Li J, Laursen TM, Precht DH, Olsen J, and Mortensen PB. Hospitalization for mental illness among parents after the death of a child. N Engl J Med 2005; 352: 1190–1196. 11. Mortensen PB, Pedersen CB, Westergaard T, Wohlfahrt J, Ewald H, and Mors O et al. Effects of family history and place and season of birth on the risk of schizophrenia. N Engl J Med 1999; 340: 603–608. 12. Knudsen LB. The Danish Fertility Database. Dan Med Bull 1998; 45: 221–225. 13. Poulsen S, Rønne T, Kok-Jensen A, Bauer JO, and Miörner H. Tuberculosis in Denmark 1972–1996. Ugeskr Laeger. 1999; 161 (23): 3452–3457. 14. MedCom Status report. On the threshold of a healthcare ITsystem for a new era. Status report, MedCom 5, MC-S212 December 2007, accessed 2 April 2009. 15. MedCom, Denmark:1 Danish Health Data Network. Project story. eHealth Impact 7.7 DG INFSO 2006. 16. Weisman R. Military’s snafu highlights PDF redaction. PDFzone, May 2005. http://www.pdfzone.com/ c/a/utilities/militarys-snafu-highlights-pdf-redaction 17. Poulsen K. Justice e-censorship gaffe sparks controversy. SecurityFocus October 2003. http://www. securityfocus.com/news/7272. 18. Johnson D. The PDF prescription for health care cost control. Appligent Document Solutions, August 2009. http://www.appligent.com/talkingpdf-thepdfprescriptionforhealthcare. 19. Ward M. The hidden dangers of documents. BBC News, August 2003. http://news.bbc.co.uk/2/hi/ technology/3154479.stm. 20. Adams JE. Dual-energy X-ray absorptiometry. Medical Radiology: Radiology of Osteoporosis. 2nd edn. Springer, 2008, pp. 105–124. 21. Dahl MR, and Høyer CB. Professional PowerPoint presentations can compromise data security. J Inform Ass Sec 2009; 4: 42–47. 22. Furnell S, Jusoh A, Katsabas D, and Dowland P. Considering the usability of end-user security software. IFIP International Federation for Information, Proceedings of 21st IFIP International Information Security Conference, Karlstad, Sweden, 2006; 201: 307–316. 23. Forrester J, and Irwin B. An investigation into unintentional information leakage through electronic publication. Information Security South Africa, 2005. 24. Aura T, Kuhn TA, and Roe M. Scanning electronic documents for personally identifiable information. ACM WPES 2006. 25. Byers S. Information leakage caused by hidden data in published documents. IEEE Sec Privacy Mag 2004; 2 (2): 23–27. 26. Yam CS. Removing hidden patient data from digital images in PowerPoint. Am J Roentgenol 2005; 185: 1659–1662.

Downloaded from jhi.sagepub.com at WAYNE STATE UNIVERSITY on April 14, 2015

What you see is not what you get in the PDF document format.

The sharing and storage of scientific knowledge, information and data are today mainly in digitized form, which will become the predominant means of c...
439KB Sizes 3 Downloads 8 Views