Data Quality, or “Garbage in, Garbage Out”

Another week, another series of massive data breaches.  Of the few we heard about in the last seven days, none are as concerning as Facebook’s breach involving more than 50 million user accounts.  Those are the kinds of numbers that, depending upon how the breach occurred, could incur a massive penalty at the hands of the Data Protection Commissioner in Ireland, which has already announced that an investigation is underway.  And in response to the story, FTC Commissioner Rohit Chopra issued a laconic but ominous tweet stating, “I want answers.”  Yikes.

In the clamor surrounding a breach, we naturally focus on the number of files accessed — look at any one of the headlines surrounding the Facebook leak and you’ll see “50 million” repeated over and over again.  It’s the same for every breach – you may not remember how Yahoo was compromised, but you’re more likely to remember that all three billion users’ accounts were exposed.  We focus on the numbers because they’re easier to grasp than, say, the intricacies of malware or the miscommunications between departments that left gaps in security architecture.  We also associate volume with quality and prowess:  “Wow, Facebook had 50 million records hacked, what an incredible amount of information they have!”  To many observers, a big number of files means a high quality data intake operation.

Except, not really.

There is no natural link between the number of pieces of data you have and the quality of that data.  In fact, much of the data Facebook has is necessarily “bad,” because it hasn’t been updated or it was never good to begin with.  It’s not good that this information was improperly accessed, but it isn’t going to necessarily lead to harm to consumers.  So amid those fifty million facebook files are countless pieces of worthless information — anything from an outdated profile picture to a cancelled credit card number to a GRU-controlled spam account.

tumblr_ne8fbcYjVR1ru5h8co1_500
100% accurate depiction.
At the same time, that information wasn’t doing much to help Facebook, either.  Indeed, companies spend millions of dollars per year dealing with the consequences of bad internal data; Gartner research suggests that the average yearly cost to an American company is nearly $10 million.  In some ways, low quality data is actually worse than no data at all, because at least in the absence of data, you know that you need to search for information and grow your datasets.  When you have bad data and you don’t know that it’s bad data, you’re in the worst of all worlds.  It’s the opposite of a Rumsfeld-style “known unknown” — in this situation, you know you have data but you don’t know that it’s bad.

Low quality data poses problems from a regulatory perspective as well.  The GDPR enshrines the right of data subjects to request the rectification of the data that you possess about them.  This right implicates both your ability to process data subject access requests (which are almost always the first step in rectification) and your ability to meaningfully identify how, when, and where you can correct defects in the data you have.  A slow, half-hearted, or non-existent data rectification regime can be the cause of serious problems with regulators, who have every incentive to treat the correction of data as seriously as its outright deletion.  Rectification isn’t just a GDPR issue: the right to rectify is enshrined in data regimes around the world, from South Korea to Brazil.

So what can you do?  The answer is a complicated one, because it requires short term

giphy
I’m still paying for a Blockbuster membership.  Cool!

investment and long term discipline.  To start, you really need to do that data inventory you’ve been pushing off for months (ahem, years), because until you have a comprehensive catalog of the information in your possession, you will not be able to confidently identify gaps, flaws, or weaknesses in quality.  You also have to understand the data flows at your company.  Are you still pulling in data from MySpace as a component of your direct marketing profile for consumers?  Are you positive that you aren’t?  Like a recurring payment you authorized in 2006 and forgot about, scrapes or paid data appends from less-than-ideal sources can pollute your dataset and leave you with suboptimal profiling tools.

Two other strategies to implement: verification and streamlining.  The former is straightforward, in that you use known, high-quality datasets as the source of truth for information about consumers, clients, or targets.  For instance, if you’re storing a list of all of your presently-paying, email-opening, link-clicking clients (and you really, really should be), that’s a very good source of truth for verification, because you’re getting active contact from the customer, who is, in effect, self-verifying.  From that base, you can cross-check other datasets.  So, if your source of truth has information A, B, C on a customer, and another source of information has A, B, C, and D, you can be more confident that D is accurate based on the presence of the other corroborating data.

Finally, streamline your datasets.  Get rid of information that you don’t need, don’t use, don’t want to pay for, or shouldn’t have.  Yes, this is the principle of data minimization couched in another name, and yes, it’s important to do this to comply with GDPR and other regulations.  But it’s also just good business sense.  Why are you paying a data append company hundreds of thousands of dollars a year to provide you with 50 fields of data when you use six and need four?  Close down data inflows that don’t serve a purpose, and close down your use of datasets that you don’t need, and focus on leveraging data that actually drives growth and promotes value.

David Mamet understands data quality.

Leave a Reply