I hope this article finds you all safe and healthy. I know that many of us are doing our best to get through this strange time but the bright side is that new opportunities have presented themselves. I have been wanting to look into Python programming for some time now and this little hiatus we all find ourselves in has afforded me the opportunity to take the plunge. While going through some of the initial exercises on machine learning in Python, I was struck by something that made me smile. Before I get to that however, let me backup a bit and touch on something I have been focused on over the last two years of my career; data quality.
I have been working in data for a long time. There is no reason to date myself here but let’s just say that my first data role was working on an Informix database used by a call center system. As such I have had some experience with data over the years. I was an ETL developer for many years and I settled into business intelligence for the vast majority of my career. I’ve even written a book on BI you can find on Amazon called Cooking with Business Intelligence. Don’t worry, I am not going to give you the entire resume. There is a reason I am providing this background though. I want you to understand that I have been a part of every aspect of the data lifecycle. From developer to operational manager to strategic executive sponsor, I have done it all. Business Intelligence and Advanced Analytics have been extremely fun and fulfilling but I always find that my true love is data governance and quality. I know that sounds lame since and you’re thinking “how boring can you be?”, and the answer is very!
The reality is that were it not for foundational data efforts in data quality and governance, there would be no insight from machine learning and AI initiatives. Take a look at my other articles for more in-depth rants on these topics if you like that kind of stuff. Here I want to discuss the specifics of data governance and quality as they pertain to programming in Python and other machine learning languages like R. Getting back to my experience with the language, I found myself spending a good deal of time in the data preprocessing phases of learning Python. At first, I was asking myself when we would get to the fun predictive elements and then I realized just how important these steps were. Unfortunately, they took time. This is time I imagine many data scientists don’t want to spend insuring data is fit for purpose. For instance, when working with my initial dataset, there were a couple rows of features (essentially data columns that represent independent variables) with no values in them.
The first thought here might be just to dump the row, but how dangerous might that be when the row is a critical data element needed to make an accurate (or more accurate) prediction? Should we develop a habit of trashing data when it doesn’t quite fit our model? Sounds a little like manipulation to me. The Python answer to this is to clean up the data by using various techniques such as the following one I used in my first Python exercise:
I won’t bore you with the details of what all of this means and most of you are probably laughing at my basic comprehension of Python anyway. The important thing to know about the code above is that it takes the mean of the column with the missing value and uses that average to fill in the blank. It’s a good substitute but a substitute nonetheless. Remember in school when your favorite teacher was out sick and a sub came in to cover for her? They were okay but that was a long day and we were happy to have Mrs. Tyler back in the classroom the next day. This is like that. It’s a good second choice, but it’s not our first choice. We would rather have that missing value.
Now, sometimes that value cannot be recovered. It may be missing for a reason, such as it was not provided or never existed in the first place. Unfortunately, my years in the data space have taught me that this is not typically the reason for our missing data. The primary reason for the error is that we have no mechanism in place to ensure data is captured appropriately and that the right people are alerted when data has not been captured. Were those mechanisms in place we may be able to go back to the point of data entry immediately when the discrepancy is found and make the correction at the source, which is exactly where the problem should be corrected.
You may be asking if all of this work up front is worth the effort. I understand that because these other quality and governance resources can be expensive. Think of this though: the results of bad data in a machine learning algorithm can be as stark and impactful as the following example.
Figure 1 - Well-Fitted Regression Slope
Figure 2 - Poorly-Fitted Regression Slope
The first diagram is a well-fitted regression line followed by one not so well fitted. The farther along your data journey that second example takes you the worse your predictions will be and the more your customers will be asking how well you really know them.
We need to address these data issues up front and leave the heavy analytics magic to those very expensive data scientists who never asked a teacher, “When am I going to use algebra?” We can do this by implementing a few basic policies and procedures that we would ask both our data development teams and business stakeholders to adopt in the name of accuracy, consistency, timeliness and confidence.
The first thing to consider is the creation of processes that are baked into your ETL (extract, transform, load) programs that check for and alert the appropriate owners when a data discrepancy is found. ABC (Audit Balance and Control) processes are one way to ensure you are proactively seeking out and correcting data issues at the source. The process is simple:
Audit – you audit your data coming from source to ensure it conforms to a data quality rule. Something like, I always expect the social security number field to have numbers, and no letters.
Balance – I am asking the source for 100 records and in those records I have a dollar value in a field and that field adds up to $1000 when I sum up every record. That should remain the same when the data arrives at the target.
Control – Finally, if I detect that data I received has broken a rule, I can reject it or capture it in an exception report and ask the source owner to investigate. They would correct the data in the source and it would come back over in the next processing ETL routine.
This is a technical solution to our data quality issue. Is technology the biggest challenge in this scenario though? Unfortunately not. I say unfortunately because technology is not difficult to figure out. Spend enough time on a technology tool or language and you will eventually get it handled. The same cannot be said for cultures. Our main concern with data quality and governance is culture. Cultures are complex, varied and most of all, full of personalities. Culture is the hardest thing to manage in this space because it introduces so many variables that the introduction of data process just can’t overcome. So what do we do about culture?
The best way to address quality issues from a culture perspective is education. Typically, when someone does not want to adopt a new policy or procedure it is because they do not understand it. You just stumbled into the data governance section of the article. We cannot simply cram data governance policy down people’s throats here. Just ask my friend and fellow author Bob Seiner. Bob describes non-invasive data governance in the following way:
"I define Non-Invasive Data Governance™ as “the practice of applying formal accountability & behavior to assure quality, effective use, compliance, security and protection of data.” Non-Invasive describes how governance is applied to assure non-threatening management of valuable data assets. The goal is to be transparent, supportive and collaborative."
Bob’s book, “Non-Invasive Data Governance can be found on his long-running data management site here: TDAN.com. It’s important to bring our business partners along for the ride and as Bob states, being “collaborative” and “supportive” in that process while being “non-threatening”. This means that we educate our business partners and work with them to create policies that mandate the use of quality processes to ensure that data is fit for use.
There is a lot to the data quality space but the payoff is worth the extra investment and effort. Believe me, your data scientists will thank you for the clean and tidy data and your business stakeholders will be amazed with the speed with which they get results from the advanced analytics processes.
That’s it this time around. Remember to stay safe and appreciate this time we have with family.