Date: Wed, 12 Oct 94 13:05:59 EST Errors-To: Comp-privacy Error Handler From: Computer Privacy Digest Moderator To: Comp-privacy@uwm.edu Subject: Computer Privacy Digest V5#047 Computer Privacy Digest Wed, 12 Oct 94 Volume 5 : Issue: 047 Today's Topics: Moderator: Leonard P. Levine Responses to Medical Data Security Questions ---------------------------------------------------------------------- From: Richard Goldstein Date: Wed, 12 Oct 1994 09:13:52 -0700 (PDT) Subject: Responses to Medical Data Security Questions Following is a summary of the responses I received to my request for help on the issue of allowing an outside University-based computer research group access to an HMO's medical record; responses were received from one person who wished me not to post their name and from Tom Lincoln, Vicki Rosenzweig, Rogier Wolff, Richard Threadgill, Matthew Elvey, Bill Ellett, Jeff Hupp, Mark Durst, Carl Ellison, David Stodolsky, Grant Grundler, Peter Sherwood, and David Harvey. My sincerest thanks to all for your responses, which were very helpful. In a few cases, I have added a comment of my own to the summary; these added comments appear inside curly brackets ("{ }"). General responses not necessarily tied to any particular part of my original note: 1. Systems operators of the computer system will almost surely have access to any information contained on the system. This group probably includes student operators. You might want to express concern to the university in how they protect your data from these employees. It is, however, likely that operators will have no interest in your data, and may well be unaware of exactly what it is. 2. You probably want an advance list and complete veto power over exactly who the "researchers designated as members of the [HMO] project" are. Also, you may want a written statement from each of them stating that they are aware that the records may contain confidential information, and that they will respect that confidentiality. Also, I assume that the university researchers will include faculty. This probably means that they will consider their student assistants to be extensions of themselves, and that they will give complete access to thier records (i.e. your records) to those assistants. They will have faith in those students' confidentiality, but do you? {It turns out that there are faculty on the University team so this was very helpful.} You may also want to control how the information is used by professors in classes or papers. You may also want to limit whether they can say that the data on which they based their study came from your HMO, by name (or maybe require that they give you credit). 3. You will want to know what computers have access to your data. Is the data put onto disks on the main university computer system, or is it put onto PC's in professors' offices. If the former, then university hackers may gain access. If the later, then professors may not safeguard access as thoroughly as you might wish. 4. When and for how long will your data be mounted on the system? Frequently, data may be loaded onto a system, and at the conclusion of the project, unless it is a large file, may be "forgotten". Will will it be removed? Once it has been removed, who will have access to the tapes? Will the university retain a copy of the tapes, or will all copies be either returned to you or destroyed? 5. You say that the university will be provided with sample records where the records have been sanitized. Who is doing the "sanitizing"? The HMO staff? The university researchers themselves, or others. {Currently the HMO staff is doing the "sanitizing".} I couldn't tell if they want your records or if you want their help. If you are doing them the favor, then you might suggest that the university provide programs to be run on YOUR computer which will sanitize the records before they leave your control. Otherwise, you want to be very clear on exactly who has access to the records before they are sanitized, and on exactly what happens to ALL backup tapes and other electronic copies, and all paper copies of the original records. The tapes should be specifically erased, not simply re-used. {Apologies for the unclarity--the HMO wants the University's help, but also the University wants our records.} If done at the university, you may want to require that the sanitizing not be done on the university's main computer system, but instead on a stand-alone system. You are obviously very concerned about any access to the original records. Also, if you are not doing the sanitizing by HMO staff, you may want to require that a random sample of, say, 100 sanitized" records, be provided to you in paper so that you can review whether the sanitizing actually was complete. You may also want to require that the records not be loaded onto the system tfor the researchers until you have had the opportunity to review the sample, and until you have approved that the sanitizing was complete. Also require that if at any point the HMO comes to believe that the records are not sufficiently sanitized, for any reason, then the university will "immediately" remove the data from the system and will stop all access until you again agree that it has been sufficiently cleaned up. 6. As you say, you may have problems in names which appear in the text portions of the records. One manner to search for these would be to a. take each word in the patron's name b. Look throughout the text for any occurrance of this text. c. Remove the matching text, perhaps replacing it with "XXXXX". A problem is that you may remove some non-name text. For example, if the name is "Rich Goldstein", then you'd remove "rich" from a sentence like "rich and rosy cheeks". But you certainly don't want to manually review every record, unless they are very few in number. {They are potentially voluminous--certainly not "very few in number".} 7. Since they are looking to provide access, I assume that the data will be indexed. To verify that the data has been sanitized, you might search for common names such as "smith" and "bill", and verify that any occurrances are appropriate. You may want to require that once they get some simple indexes built on the data, that you will be provided an opportunity to review the data again to ensure that it is sanitized. At that point, you could do these searches. 8. This is probably obvious, but: ask them whether the system administrator for this computer is a member of the project. Ask about system staff in general--they are likely to have access to all sorts of information. Ask about backup procedures: are tapes left around in unlocked drawers, or taken offsite to unsecured locations? (At my site, the offsite backup is my boss's living room, but we have no secure data.) Is the system connected to any kind of network, and what operating system is it on? For that matter, who designates members of the project? 9. One of the rules I'd propose would be to consider the data provided to the university as confidential, and under strict orders not to be proliferated, as if it were the real data. Thus the rules for the researchers should be the same as for the doctors that need to work with the data. Off course this is not completely feasable, but it should give a nice guideline to work towards.... 10. If you have time, I would recommend that rather than sanitizing some large number of records from your existing database, that you *manufacture* a set of plausible records entirely. While the university group is likely to assume that all of these records pertain to real patients somewhere, that doesn't need to be the case. I would particularly recommend this because even if the university's security precautions are strong (which I'm willing to accept), the researchers are almost certainly going to discuss amongst themselves any particularly entertaining cases they come upon in the course of developing and testing their technology. I strongly doubt that any of your patients would like to become urban legends, even without their real names attached. In the (likely) event that you can't generate an entirely fictitious set of data for them to work with, I'd guess that you can probably place a high degree of faith in the automatic masking process, but I'd recommend that you only give them a large subset of your patient data. Ideally, you'd add to the auto-mask stripping out entirely any records which contain the strings 'Mr.', 'Mrs.', or 'Ms.' - I think that will dramatically reduce the number of real proper names which end up in the data set you hand them. 12. If it's a networked unix computer, it's pretty much guaranteed insecure. 13. I don't have a lot of original material to add, but your comments don't make it clear whether you have a copy of: "Report on Statistical Disclosure Limitation Methodology", Subcommittee on Disclosure Limitation Methodology, Federal Committee on Statistical Methodology. Statistical Policy Working Paper 22 of the Statistical Policy Office, Office of Information and Regulatory Affairs, Office of Management and Budget. NTIS Document Sales, PB94-165305. Much of the Census work on this question revolves on selecting summaries that are not too revealing, with only the summaries being released to the public. By contrast, this report includes considerable material on disclosure risk in microdata. In addition, an Appendix entitled "Research Agenda" should make it quite clear what is NOT known in this area. In your place I would inquire why the University must have genuine data records, even if masked. In a similar case here at LBL we scrambled each record so as to have the right marginals, but so that no output record was precisely the set of attributes from an input record. Is the search software under consideration so sophisticated that it could not be developed using such scrambled records? You are also right to look closely into the university computer's access control. If they do not have an active program to deter unauthorized access (e.g., running "crack" programs against passwords, having automatic timeouts on user terminals) and to detect it (active monitoring by a human of system logs and accounting to spot unusual patterns, hopefully with pieces different from that supplied by the system manufacturer), they should not be trusted with any confidential data. 14. Is it an option to have the University ask you for statistical information and for you to sell them statistica results? This sounds like a source of income for you and a masking like that used for the Census. {This comment, the only one in one particular response, raises more questions than it answers.} 15. I will first address your questions on the providing of HMO data to the University *strictly* on a computer basis. No matter how much they say the data is off limits from all but approved personnel, I just can't believe they have all of the necessary tools to restrict access. A university by definition is an *open* institution. Rather than moving the data off-site, I would be inclined to allow network hookup to specified machines with Kerberos encryption. That way, if anything goes awry, you can shut down the machine which is providing the information at your site. It goes without saying that almost any querying of the data on that machine *MUST* be logged. Any attempts at illegal access to the data should be followed up by swift and appropriate legal action. 16. As for the ability to combine the information to obtain a complete profile of the person, you have noted it is indeed possible. No matter how much you automate the stripping process, I am inclined to believe the information can't always be totally removed. Thus the confidentiality of the patient doctor privilege is compromised. One rule I would have is that no patient's information should be made available without the consent of the patient. By the way, I would *NOT* sign such a consent form! But then I also have caller ID on my phone, and do other things to monitor when and how people can have access to me. Responses tied to particular parts of the original posting: >I am not aware of any literature dealing specifically with this question >for medical records (except that I do have a copy of the 9/93 publication >from the Office of Technology Assessment entitled _Protecting Privacy in >Computerized Medical Information_; however, this is not a technical >publication). Your best and most recent overall review is "Health Data in the Information Age: Use, Disclosure and Privacy" by Molla S. Donaldson and Kathleen N. Lore, Editors, The Institute of Medicine, N1994 ISBN 0-309-o4995-4. It is an even handed discussion, fully documented, with an extensive literature. By and large, access to well secured individual records are gained by confidence procedures from individuals who have a legitimate access to them, generally over the phone. The major financial gain to be derived is to harvest mailing lists of individuals with a particular illness or a particular anxiety. (Imagine what Preparation H could do with a list of those with active hemoroids!) Inferential knowledge is clearly a major issue. I have written a number of papers on the subject and would be glad to send them to you. {I sent my address and received a number of helpful papers.} >1. automated masking or identifiers such as addresses and > telephone numbers in ... extract headers as created [at the > HMO] >2. automated masking of medical record numbers >3. automated masking of each segment of each member's name > everywhere these segments occur in the ... extract" Don't count on this. Spelling and keying errors alone will leave tacks that can be followed very easily. One thing you might consider is the use of a spell checker, but the overhead this will add is almost as heavy as having some human read and edit all the records. > >There are some known problems with this masking (e.g., regarding the >occurrence of names in the record other than than of the particular >patient). My problem is that I have no idea how much faith, trust, >etc. to put into the "automated masking" process. Of particular help >would be guidance on what questions to ask about this process to help >make decisions about whether it is sufficient (guidance on literature >would also be appreciated). Have them run the production programs on a statistically significant portion of the data and review the results by hand. > >Another question relates to what we should be asking about the security >of the university computer; we have been told that the center "has >implemented data access security by granting electronic access to [HMO] >data only to researchers designated as members of the [HMO] project." >However, we have been provided with NO details; again, what questions >should we be asking and how do we interpret the responses. > Universities have both advantages and disadvantages when it comes to security. Lot's of people trying to break in, but that also trains the sysadmins on what the problems are. >I should mention that our committee very strongly opposes any movement >of hmo data outside the hmo, but in rare circumstances we have agreed >when we were satisfied with the security situation (usually a >stand-alone computer in a room that could easily be locked). > I have to agree here. That would be the only 'real' security you could count on. Be sure that there isn't an internet or dial up connection to that machine. Might as well have it in a public area if you do. > especially with respect to Census information. However, I am not familiar > with recent literature on this question or with computer algorithms; further, > I am not aware of any literature dealing specifically with this question for > medical records (except that I do have a copy of the 9/93 publication from the > Fellegi and Sunter. A theory for record linkage, JASA 64, 1183-1210, 1969. Jaro. Advances in Record-Linkage Methodology... JASA 84, 414-420, 1989. | The process includes providing the university with example records (size of | sample not known), where the records have been 'sanitized'. "The sanitization | process has three stages: | | 1. automated masking or identifiers such as addresses and | telephone numbers in ... extract headers as created [at the HMO] | 2. automated masking of medical record numbers | 3. automated masking of each segment of each member's name | everywhere these segments occur in the ... extract" | | There are some known problems with this masking (e.g., regarding the | occurrence of names in the record other than than of the particular patient). | My problem is that I have no idea how much faith, trust, etc. to put into the | "automated masking" process. Of particular help would be guidance on what | questions to ask about this process to help make decisions about whether it is | sufficient (guidance on literature would also be appreciated). Also occurance of - the patient's name in fields other than the field one is masking. - occurance of care giver's name (MD/RN/OT/PT etc) in reports - occurance of other personal info (eg. phone numbers to call) in report. | I note also that the people on the project appear to be unaware of the | possibility of identifying patients via combinations of coded information. As | a statistician, I am aware of some of the large literature on this question, | especially with respect to Census information. However, I am not familiar | with recent literature on this question or with computer algorithms; further, | I am not aware of any literature dealing specifically with this question for | medical records (except that I do have a copy of the 9/93 publication from the | Office of Technology Assessment entitled _Protecting Privacy in Computerized | Medical Information_; however, this is not a technical publication). This takes some luck, good insight, and leg work. Not sure this is what the university is after or has time to verify everything. | Another question relates to what we should be asking about the security of the | university computer; we have been told that the center "has implemented data | access security by granting electronic access to [HMO] data only to | researchers designated as members of the [HMO] project." However, we have | been provided with NO details; again, what questions should we be asking and | how do we interpret the responses. | | I should mention that our committee very strongly opposes any movement of HMO | data outside the HMO, but in rare circumstances we have agreed when we were | satisfied with the security situation (usually a stand-alone computer in a | room that could easily be locked). The moment you let someone else have a copy of any portion of the data base, you loose total control of that data. Some student or bystander will get access. Or a professor takes it home over the weekend...IMHO, processes like this just don't work because they require *everyone* be trustworthy. | Any help or advice would be greatly appreciated and should, preferably, be | sent directly to me at "richgold@netcom.com". If desired, I could post a | summary of the resulting responses to this group. please do - I'm curious what kind of systems people use to share sensitive data and how they protect it. IMHO, I would only let the University install stand alone HW to access your systems and set up queries to generate statistics. No external network or removable media. If data needs to migrate back to the university, set up an operation to verify the data is cumulative in nature and does not contain any personal info. You can control who and how the data is accessed. : The HMO has entered into an agreement with a 'local' : university (about 90 miles away) to attempt to develop tools for : exploiting clinical text data (e.g., access, search, extract, : manipulate the text portion of the record). : The process includes providing the university with example records : (size of sample not known), where the records have been 'sanitized'. In my experience, it is impossible to sanitize databases, for just the reasons you mention, and also because someone on the project may recognize a specific case. You are also correct to be skeptical of university security. For this reason, a different procedure should be followed for testing. 1) Write a program to generate artificial records. This takes about the same amount of thought as "sanitizing" the database. It's not trivial, but not overwhelmingly difficult. 2) Provide the University group with the artificial records for testing. When the University is satisfied with the results, let them provide you with a test release of the software (or whatever portion they are working on). HMO personnel then test the software on live data, at the HMO. This may require a loan or rental of hardware. 3) If problems are found (likely), the artificial record generator may need to be modified to create records of the problem type. This method of testing has another advantage besides protection of patient records: by creating a random selection of records with characteristics of real records, you can create a more diverse database and catch more problems. And, of course, you can make the database as large as you want, include dates later than 12/31/99, >100-year old patients, patients with large numbers of visits or diagnoses, and in general stress the system. {While no decision has been made yet, our Human Studies Committee (IRB) decided to ask a whole range of questions, largely based on the material above; I also note that it turns out that our state is one of the many states that has a law requiring that informed consent from patients be obtained prior to giving any data with identifiable information to anyone outside the medical provider group; it is not clear what the criteria are for deciding whether data contains identifiable information. Again, thank you all very much. Rich Goldstein} ------------------------------ End of Computer Privacy Digest V5 #047 ****************************** .