Everybody loves a fad. You can pinpoint someone’s generation better than carbon dating by asking them what their favorite toys and gadgets were as a kid. Tamagotchi and pogs? You were born around 1988, weren’t you? Coleco Electronic Quarterback and Garanimals? Well well, an early X-er. A fad is cultural currency and social lubricant at the same time: even if you don’t have the thing itself, it’s a shared reference point that helps locate you as part of a particular time and place. Paradoxically, fads also help identify when a concept has gone stale, depending on who does it.
Fads happen in business, too. From corporate retreats to themed attire days (back in the olden times when we went to retreats, offices or, you know, anywhere) or the more recent mandatory fun on Zoom, enterprises are no less susceptible to fads, especially when they involve technology. Part of it is a desire to seem cutting edge, but a large part of it, we think, is simple misunderstanding. Without a good grasp of new systems and tools or the concepts that underlie them, it’s hard to tell the difference between a fad and a future.
Case in point: anonymization. Although the concept of masking identity or erasing identifiable features has long been a component of data science, it was not a widespread topic of discussion in industry in the US until the late 2000s and, really, just before GDPR came into effect and fears of 4% penalties kicked in. Hundreds of vendors promise services that allow you to “anonymize” user data in an effort to find safe harbors or avoid liability, but most businesses have only a vague understanding of what the concept of anonymized data really is and how to do it.
To unpack anonymous data, it’s important to clear up a few terms so that we don’t run into confusion. First, what is anonymized? Anonymous data is data that does not relate to an identified or identifiable natural person, or data modified such that the data subject is not or no longer identifiable.
That is an extremely vague definition for a concept that is so important, and so let’s dive into that a little more, because this is a game of definitions (every lawyer’s favorite game). If data, on its own or with other data, can identify you, it’s personal data. We don’t talk about personally identifiable information, anymore; that fad has passed. These days, you only talk about personal data.
There are ways to make data less useful in identifying a person, but that does not mean that it is anonymous. Instead, there are varying degrees of data obfuscation — means hiding attributes to make reidentification more difficult — on the way to actual anonymization. Here are the two most important kinds.
What it Is
Masked Data is information modified to hide (or “mask”) the underlying, true data. This is a common practice in business, and it is most effective against unauthorized internal review (and pilfering) of valuable business/customer data and against external actors learning important details about clients and vendors. A simplified explanation of masked data is a customer list that details first and last name, age, address, and amount spent with surnames changed to dummy names, ages shifted, and amounts spent reallocated randomly. Much of the derivative analytic data remains the same (amounts spent, total number of customers, locations of accounts, etc) but it is difficult to reidentify any individual user.
What it Isn’t
Having a list where the names and identifiers are shifted is a great business approach, but it usually falls short of anonymous in the real world. Why? Because usable data is accurate data, and being able to run the kind of analytics you want means being able to easily mix and match the true underlying information. As such, having the master list (the non-masked data) available means that you will always hold onto the original information, which means you’re still holding personal data, which means you’re not protected by the anonymity safe harbor. Thanks for playing.
What it Is
Pseudonymous data is data that has the most important identifiers removed: names, email addresses, social security numbers, etc. Pseudonymous data still identifies a person, but it isn’t obvious on its face who that person is. Think back to school when they would post grades outside of a classroom but only use student numbers on the chart. In the Mad-Max rush to the sheet of paper to see your grades, it wasn’t possible to see anyone else’s name, and so you only were able to know what your outcome was. This is a good example of pseudonymization and a good example of why it’s used: to protect the rights of individuals from unnecessary exposure of their personal details, including a devastatingly embarrassing failed geometry test in ninth grade.
The more attributes you remove from a dataset, the thinking goes, the more pseudonymized the data becomes, and the closer it gets to full anonymization, at which point you’re in the clear.
What it Isn’t
A panacea, or, honestly, nearly as useful as it might sound. Pseudonymization in practice is often something like this:
- We have an excel spreadsheet with names, addresses, account numbers, customer spend, and profile data.
- We delete the customer name.
- Presto, pseudonymized data!
Of course, that might technically count as pseudonymization, but it’s virtually useless: you still have every other identifier for an individual, which means that not only is it not difficult to re-identify the person at issue, you haven’t even de-identified them to begin with. Think about it from a data perspective, rather than a human perspective: Column A contains alphanumeric characters used to identify an individual account, so does Column B. If they both do the same thing, what difference does it make if you delete Column A (where the alphanumeric characters are organized into what humans recognize as names) and keep Column B (where the alphanumeric characters are organized into what humans think of as an “account ID number.”)? Under the law, it’s all the same, and the database/algorithm analyzing the data won’t have any problem continuing on as before the deletion.
“Fine!” you shout, annoyed, “why don’t we just delete names, addresses, account numbers, and credit card information and only keep the more vague data attributes!” A great idea, and it’s the thought process behind GDPR’s approach to anonymization: if you delete enough data and remove enough identifiers, eventually you’ll get to a place where you don’t have personal data anymore and the rights of natural persons are protected.
Except not really.
If you’re keeping any data at all, and especially if you’re keeping multiple data points and attributes, the likelihood is that you’re going to wind up capable of reidentifying an individual. A very important study in Nature Communications reviewed a variety of “anonymized” datasets and came to a pretty striking conclusion:
Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model
In other words, if you have enough data attributes, even “anonymous” data is nothing of the sort, which means that GDPR’s approach to anonymization (followed around the world) has a fatal flaw in the underlying thought process, and the Get-Out-Of-Brussels-Free Card that data companies thought would protect them is actually fairly useless.
A Newer, Better Fad
This is usually the point in our blogs where we say “the good news is that there is another option” and lay out how to approach things differently. But today, we’re actually going to suggest following an older strategy to avoid some of this anonymization difficulty.
Step 1: Get rid of all the data you don’t need to fulfill your core purposes tied to the data.
Step 2: Then, once the core purpose is fulfilled, aggregate all of the data you need to run your analytics.
Step 3: Now delete the rest of the underlying data. Yes, all of it.
You may be thinking that you’ve just deleted all of the data and you’d be right. That’s often the best answer: you can’t be held liable or responsible for data you no longer own. Get rid of it! Aggregated data is, in our view, the only truly anonymous data out there, because it’s not possible to walk the process back and reidentify an individual from aggregated statistics.
Now, will this work for everyone and for every dataset? Of course not. Sometimes you need the data for business purposes or for regulatory reasons. But in those cases, anonymization wasn’t appropriate anyway, because you have ongoing duties to protect data based on usage. Put another way, the problem with the anonymization fad is that it encourages shortcut thinking about data: “If we pseudonymize well enough, we can just do whatever we want with the data!” Except no, you can’t, and the data protection authorities are very touchy about what qualifies as properly pseudonymous or anonymized.
Is it possible to truly anonymize data? Yes. Is it the answer to all of your data concerns? Probably not, because the most important aspect to your data is how you use it, how you learn from it, and how you leverage it to grow. Anonymized data is stripped of much of its usefulness in favor of a flimsy sense of getting out of regulatory oversight. In the end, it’s a far better plan to protect the data you want, delete the data you don’t, create anonymous data only if it fits certain limited parameters, and leave the fads to the other folks. This approach gives you more time, resources, and money — and they never go out of fashion.