At first glance, the quest for knowledge seems fairly straight-forward. Collect data when required; the more data the better. Unfortunately it is not so easy. Turning data into knowledge is predicated on the assumption that the data you are analyzing is “good data.” Otherwise, the data is not reliable and therefore the analysis and resulting insights are fundamentally flawed.
The ugly truth of big data is that it is not dependable. This lack of dependability has major consequences:
So, what does good data look like? Before we dive into the gold standard, let’s identify the hallmarks of “bad data.” Keep in mind, you may think your data is in fact “good,” but if you fall into any one of the following indicators then there’s room for dramatic improvement.
How much time do you spend cleaning data? I’m jumping to conclusions here, but I’m confident that “how much” is more appropriate than “do you” when it comes to cleaning data. Whether your data was recorded incorrectly, or isn’t intelligible, or is even missing fields, cleaning data is a painful process that requires many man-hours and is equally prone to error. Dirty data is an indicator of a poor data collection tool.
Do you have to repeat data collection? Sometimes errors in data are minor and can be fixed quickly. Other times they require bringing in other team members, revisiting a site, or even starting over completely. Worse is when data is lost and cannot be recovered or repeated. Then the estimated record is unreliable and otherwise indistinguishable from its neighboring legitimate data.
Do you spend time converting data? Data conversion can take on many forms, all of which are time consuming. Transferring paper records into a database, matching pictures with their appropriate inspection, and scoring records from a matrix of lookup values are all varieties of converting data. The necessity to transpose or transform your data is a sign of immature data. You’ve collected what is necessary, but it is not in its final form.
Now, what about good data? I use three simple guidelines to maximize the collection of “good data”.
Follow the principle of tightest fit. Say you’re registering members electronically for your new fan club and need to collect their mailing address. How do you capture the state? Free text opens up a world of failure: spelling, inconsistent or incorrect abbreviations, and capitalization are all knocking on the door. I recommend a checklist or pre-defined set of responses whenever possible. Your data integrity as well as speed of capture will both improve. Working on a digital platform, validation rules are also recommended to ensure an open-ended data type such as numbers are appropriately bound by rules. Remember, the goal is to eliminate the ability to capture bad data.
Minimize human interaction and decision-making. What if you didn’t need to fill in city or state in our fan club registration? What if you didn’t have to “skip to skip to section 8”? Lookup tables, automatic calculations, and skip logic should be implemented whenever possible to minimize the data collector’s influence. Next time you go grocery shopping, imagine if there was no barcode scanner. The manual entry of each item would be slower and more prone to error.
Avoid becoming a system of systems. Good data should open possibilities for systems, not require systems to support it. A successful data solution will keep the number of data stores, handoffs, and conversions to a minimum. You do not want to waste time copying records into a database, manually associating multimedia to records, and sending notifications. These tasks should be automated or eliminated completely.
We’ve outlined how to identify bad data, why you cannot afford to be a statistic, and guiding principles to ensure your data works for you. So what’s the next step? If you’ve read this far then you’re on the right track. Getting more informed and discussing options with somebody who is not biased towards the “as-is” will be critical first step towards the “could-be”. Find a strategic partner who will not allow you to compromise on elements of bad data and reap the long-term rewards of your vigilance.