Tag Archive for data scientists

Do Data Scientists Fear for Their Jobs?

What happened in this last election, November 2016? Rather, what happened to the analysts in this last election? Just about every poll and news report prediction had Hillary Clinton leading by a comfortable margin over Donald Trump. In every election I can recall from years past, the number crunchers have been pretty accurate on their predictions—at least on who would win if not the actual numerical results. However, this turned out not to be the case for the 2016 presidential race.

But this is not the first time this has happened. In 1936, Franklin Delano Roosevelt defeated Alfred Landon, much to the chagrin of The Literary Digest, a magazine that collected two and a half million mail-in surveys—roughly five percent of the voting population at the time. George Gallup, on the other hand, predicted a Roosevelt victory with a mere 3,000 interviews. The difference, according to the article’s author, was that Literary Digest’s mailing lists were sourced from vehicle registration records. How did this impact the results? In 1936 not everyone could afford a car, therefore, the Literary Digest sample was not a truly representative sample of the population. This is known as a sampling bias, where the very method used to collect the data points introduces its own force on the numbers collected. On the other hand, Gallup’s interviews were more in-line with the voting public.

The article cited above also mentions Boston’s ‘Street Bump’ smartphone app “that uses the phone’s accelerometer to detect potholes… as citizen’s of Boston … drive around, their phones automatically notify City Hall of the need to repair the road surface.” What a great idea! Or was it? The app was only collecting data from people who a) owned a smart phone, b) were willing to download the app, and c) drove regularly. Poorer neighborhoods were pretty much left out of the equation. Again, an example of sample bias.

The final case, and not to pick on Boston, but I recently heard that data scientists analyzing Twitter feeds for positive and negative sentiment, had to factor in the term “wicked,” as a positive sentiment force, but only for greater Boston. Apparently, that adjective doesn’t mean what the rest of the country assumes is means.

Along with sampling bias, another driving factor in erroneous conclusions from analyzing data is the ‘undocumented confounder.’ Suppose, for example, you wanted to see which coffee people prefer better, that from Starbucks or Dunkin’ Donuts. For this ‘experiment’, we’re interested only in the coffee itself, nothing else. So we have each shop prepare several pots with varying additions like ‘cream only’, ‘light and sweet’, ‘black no sugar’, etc. We then take these to a neutral location and do a side-by-side blind taste comparison. From our taste results we draw some conclusions as to which coffee is more preferred by the sample population. But unbeknownst to us, when the individual shops prepared their various samples of coffee, one shop used brown sugar and one used white sugar, or one used half-and-half while the other used heavy cream. The cream and sugar are now both undocumented confounders of the experiment, possibly driving results one way or the other.

So, back to the elections, how did this year’s political analysts miss the mark? Without knowing their sampling methods, I’m willing to suggest that some form of sample bias or confounder may have played a part. Was it the well known ‘cell-only problem’ again (households with no land-line are less likely to be reached by pollsters)? Did they take into consideration that Trump used Twitter as a means to deliver sound byte like messages to his followers, bypassing the main-stream media’s content filters? Some other factor perhaps as yet unidentified? As technology advances and society trends morph over time, so must political polling and data analysis methods.

Pollsters and data scientists are continually refining their methods of collection, compensation factors and models to eliminate any form of sample bias in order to get closer to the ‘truth.’ My guess is that the election analysts will eventually figure out where they went wrong. After all, they’ve got three years to work it out before the next presidential race starts. Heck, they probably started sloshing through all the data the day after the election!

One needs to realize that data science is just that, a science, and not something that can simply be stepped into without knowledge of the complexities of the discipline. Attempting to do so without the full understanding of sample bias, undocumented confounders and a host of other factors will lead you down the path to a wrong conclusion, aka ‘failure’. History has shown that, for ANY science, there are many failed experiments before a breakthrough. Laboratory scientists need to exercise caution and adhere to strict protocols to keep their work from getting ruined from outside contaminants. The same for data scientists who continually refine collection methods and models for experiments that fail.

So what about the ‘data science’ efforts for YOUR business? Are you trying to predict outcomes based on limited datasets and rudimentary Excel skills, then wondering why you can’t make any sense out of your analysis models? Do you need help identifying and eliminating sample bias, accounting for those pesky ‘undocumented confounders’? Social media sentiment analysis is a big buzz-word these days, with lots of potential for companies to mix this with their own performance metrics. But many just don’t know how to go about it, or are afraid of the cost.

At BlumShapiro Consulting, our team of consultants are constantly looking at the latest trends and technologies associated with data collection and analysis. Some of the same principles associated with election polling can be applied to your organization through predictive analytics and demand planning. Using Microsoft’s Azure framework we can quickly develop a prototype solution that can help take your organization’s data reporting and predicting to the next level.

About Todd: Todd Chittenden started his programming and reporting career with industrial maintenance applications in the late 1990’s. When SQL Server 2005 was introduced, he quickly became certified in Microsoft’s latest RDBMS technology and has added certifications over the years. He currently holds an MCSE in Business Intelligence. He has applied his knowledge of relational databases, data warehouses, business intelligence and analytics to a variety of projects for BlumShapiro since 2011. 

Data scientist