Deidentification

Deidentification of data #

The team and I do a lot of stuff with patient experience data and it can be tricky to reuse it without thoroughly redacting it because people put personal information in there- sometimes asking for a response, by email or telephone. It’s a bit of an unusual case because the data should be anonymous- there’s no need to identify patients in it at all, I just want the text of their experience, but in practice it isn’t.

Having some way of automatically redacting patient experience data to a high enough standard to reuse the data would be very useful, and to that end I made a pull request on the NHSX internship site to see if they could look at it with a PhD student.

Jonny Pearson very helpfully gave me some links to some resources to improve it and I need to digest the contents of a bit, may as well do it here in case it helps anyone else (or future me, for that matter 😄). We have:

You may just want the links and ignore my witterings below of course 😉.

Introduction to anonymisation #

At a simple level, organisations who are trying to anonymise data will mask direct identifiers. Some data, like diagnosis or medication is not identifiable. Some data, like name or phone number is identifiable, and these variables are known as direct identifiers. Masking the direct identifiers should in theory make it impossible to identify individuals, but Sweeney showed that combining quasi identifiers could lead individuals to be identified. Quasi identifiers are values that are not directly identifying on their own, but may be identifying in combination. Sweeney suggested that this kind of linkage attack can be prevented by generalising some of the data. For example, using year and month of birth rather than exact birth date.

Sweeney codified these ideas within a concept called k-identity. The k variable gives the minimum number of people who are indistinguishable from each other in the dataset. A k-anonymous dataset means that each row is indistinguishable from k-1 other rows. For example, when k is 5 each individual is indistinguishable from 5 other individuals.

The modern big data landscape #

In more recent times the data available about each individual has proliferated and additional controls are often required in order to thoroughly anonymise data. Controlling the environment can help with this, for example by having data accessed in a secure area where data access is monitored, or providing data in a secure way that doesn’t allow it to be recombined with other data, like a trusted research environment.

Environmental controls do not generally affect the utility of data for analysis, but have an associated cost, control of use, etc.

Summary #

  • Deidentification comes at a cost- either in terms of the usefulness of the data or the cost of managing environmental access to data
  • Deidentification can be very difficult when there is high dimensional or high frequency data
  • It is not always obvious what will and what won’t lead to reidentification, and there are many high profile failure of deidentified data (e.g. Netflix)

Aggregation #