How to Value Datasets – From “Data Leverage”

A strong approach to data requires constant attention not just to the quantity of data you take in, but the quality as well.  It’s never enough to know what data you have; good data strategy demands that you have a firm grasp on the value of that data.  That is not a static exercise, but rather an ongoing effort to reimagine how data can be deployed and combined in new ways that create new opportunities for growth.  In this excerpt from Data Leverage, we outline a methodology for ascertaining the value of data to unlock its potential.

Where to Start?

Data asset valuation is partly based upon degrees of coverage, depth, freshness, uniqueness, and accuracy. Each of these can significantly influence the value of your data; you can invest in improving them. For example, if your data is not comprehensive, or available across a large enough universe of coverage, you can work to increase your ingestion or gathering methods to improve on that dimension. The key here is that the different degrees of these qualities multiply the value of your data.

Begin with coverage: the ability to offer enough data in any given vertical or specialty is the first hurdle to overcome. There are two ways to view whether data represents comprehensive coverage. The first is to merely take a statistical viewpoint based on the industry. For example, you might remember those Crest toothpaste commercials years ago that loved to point out that four out of five dentists agreed that brushing with Crest toothpaste was helpful to oral hygiene. To most statisticians, this shows that 80% of dentists apparently stand behind Crest toothpaste, but that statistic never qualified whether or not the company had actually asked only five total dentists the question. Wrongly, people assume perhaps that the statistic is the common ratio of a much broader more comprehensive study, but that wasn’t made clear in those early television commercials. This gets to the heart of a comprehensive dataset. With about 200,000 dentists practicing in the United States, asking five would not offer a comprehensive analysis. In fact, at that point, not only do you have a serious extrapolation problem, but also potentially terrible teeth.

Rudolph The Red Nosed Reindeer Pain GIF - Find & Share on ...
Don’t worry, I know a guy.

Having a comprehensive dataset really depends upon the purpose of your use case. If your dataset creates actionable information, then it is probably comprehensive enough. But if your dataset is really about providing a base layer of information, you will need far larger statistical coverage. This differs market to market, but 25% coverage starts to get interesting in most cases. If you are providing business location information on the United States, you need to cover about five to six million business locations to be close to the 25% mark. If, on the other hand, you are only providing data on coin laundromats in the United States, of which there are roughly 30,000, then data on 7,500 will be valuable.

Depth is the second factor of degree. Depth of data refers to the amount of quality information contained within each record. While coverage is the number of records on a given population, depth is the number of usable data fields tied to each record. For example, consider two different data files, each with 100,000 records that represent consumers in the United States. In the first data file, each record contains the name, address and phone number of the consumer. The second data file contains the name, address, phone number, mortgage amount, own-vs.-renter status, purchase history, propensity model scores, credit ratings, education level, and about a thousand other potential data points that marketers use in ad targeting campaigns. The first file has the same coverage as the second, but the second has a significantly higher degree of depth.

Keep in mind, that while depth can always be improved, it doesn’t pay to improve it at the expense of fill rates and quality. Fill rates are defined as the number of values in each field for each record within a file. Adding more potential fields to create the illusion of depth is a common trick that data companies use to improve their datasets. These firms will say they have 100,000 records with 100 fields of depth, but what you will find upon deeper analysis is that their fill rates for each of those additional fields is very low. If you can’t fill the fields, you should really consider whether you should market them as available at all.

Image result for thanks for nothing gif
I think it’s “we’ve decided to move in another direction.”

 

Next Steps for Identifying Ongoing Value

Go on to assess the next degree, the freshness of your data. Just as in the produce section of a grocery, there is significant value in data that is fresh, updated regularly, and not rotten. Naturally, you may not be able to control the freshness of your data, depending upon the industry, particularly around actions taken by others like the timing of purchases. Weather, traffic, and new home buyers, on the other hand, change and update every day. In some datasets, freshness is impossible to improve and in others, it happens whether you want it to or not. The key is to identify the absolute best level of freshness possible given your dataset and to strive for that. In commercial real estate databases, this means you can only update square footage pricing as and when new leases are signed, but the market will reward how fast you can update that data as it changes.

We analyze freshness based upon the date and time stored next to each record that tracks the last time the record was updated. However, in modern data models, it is actually important to have an updated date and time next to every field of data within a record. It is no longer sufficient to say you updated a consumer profile last week. Your data model needs to be able to point out which field was updated last week, because data strategists have learned that updated records are easily manipulated without any true additional value or change in the valuable fields. Also, most models are now multivariate, meaning they can consider several variables at once and weight them based on their freshness. In this way, a consumer record with an updated phone number last week can be recognized as not as high a priority as a consumer record with an updated physical home address yesterday.

Differentiation is more of a marketing term than a definable attribute of a dataset. However, it is a critical degree multiplier for data. I’ve worked with several firms over the years that could demonstrate the unique nature of their data. If that unique data was actionable, it increased the market value of the data. One company, Target Data out of Chicago, was able to analyze every real estate listing and sale in the United States to extract specific attributes that allowed the company to predict when a private home or residence would sell in any given market. This derived dataset is powerfully unique because of its scale and depth, but also because it included thousands of unique factors like its formula for “granite counters” in a real estate listing in a particular zip code. That said, every dataset marketed as unique will fast find competitors seeking to mimic it. Your data collection methods, depth, coverage, and fill rates are all ways to keep your unique value proposition strong.

Accuracy is the last degree we analyze, but it is the measure by which many a data firm has thrived or perished. Data sources should have quality measures included with their data, especially now that customers or data partners can check data at scale so easily. Companies can easily lose sight of how important accuracy is. When building out your data collection, storage, and delivery methods, you need to build quality checks and controls into the process itself. For example, if you take data from customers or app users, do you enforce field validation in your forms? This is the process of ensuring that a zip code is entered as an actual zip code and a consumer can’t enter “puppy” into that field.

Image result for adorable puppy gif
[Shameless use of puppies for self-promotion]

Even if you enforce this sort of validation, it is highly unlikely that you enforce the same level of field validation for your own employees’ access to the data. This has caused many an issue for companies, sometimes going way beyond a data quality problem. Most companies who are serious about their data strategy employ a Chief Data Officer to improve data quality across every step of the collection, processing, and internal use. Companies must employ data checks, validation, and controls at all points in the creation of a dataset. To allow employees to bulk upload, bulk change, or bulk update data without the same controls threatens accuracy. The belief that one’s employees don’t make mistakes is a hallmark of a poor data strategy and an invitation to error.

Image result for office space gif
He’s an SVP now.

The degree of value attributed to coverage, depth, freshness, differentiation, and accuracy depends on the audience. Marketers who buy data by the pound for mass direct mailings to consumers don’t demand the same degree of accuracy or depth as Wall Street analysts accessing earnings estimate files for public companies. Part of your data strategy is to value your data and identify the value that the market puts on each of these factors. It is impossible to focus on all of these at once, so choose where you receive the most leverage and maximize your efforts there. Create process and best practices that protect the data from manipulation or degradation at every step of your dataset creation.

Lastly, document your process and methods. By doing so, you can highlight your efforts as you market your dataset for sales and for partnership valuation. Data is like any other product, good, or service in that purchasers and partners want to have trust in the data, that it is fit for purpose, and that it can maintain the quality needed in an ongoing, protected process. Your next step is to present the data in a simple and elegant way.

Image result for james bond gif
“….to talk about data.”

Leave a Reply