Archive for Todd Chittenden

Do Data Scientists Fear for Their Jobs?

What happened in this last election, November 2016? Rather, what happened to the analysts in this last election? Just about every poll and news report prediction had Hillary Clinton leading by a comfortable margin over Donald Trump. In every election I can recall from years past, the number crunchers have been pretty accurate on their predictions—at least on who would win if not the actual numerical results. However, this turned out not to be the case for the 2016 presidential race.

But this is not the first time this has happened. In 1936, Franklin Delano Roosevelt defeated Alfred Landon, much to the chagrin of The Literary Digest, a magazine that collected two and a half million mail-in surveys—roughly five percent of the voting population at the time. George Gallup, on the other hand, predicted a Roosevelt victory with a mere 3,000 interviews. The difference, according to the article’s author, was that Literary Digest’s mailing lists were sourced from vehicle registration records. How did this impact the results? In 1936 not everyone could afford a car, therefore, the Literary Digest sample was not a truly representative sample of the population. This is known as a sampling bias, where the very method used to collect the data points introduces its own force on the numbers collected. On the other hand, Gallup’s interviews were more in-line with the voting public.

The article cited above also mentions Boston’s ‘Street Bump’ smartphone app “that uses the phone’s accelerometer to detect potholes… as citizen’s of Boston … drive around, their phones automatically notify City Hall of the need to repair the road surface.” What a great idea! Or was it? The app was only collecting data from people who a) owned a smart phone, b) were willing to download the app, and c) drove regularly. Poorer neighborhoods were pretty much left out of the equation. Again, an example of sample bias.

The final case, and not to pick on Boston, but I recently heard that data scientists analyzing Twitter feeds for positive and negative sentiment, had to factor in the term “wicked,” as a positive sentiment force, but only for greater Boston. Apparently, that adjective doesn’t mean what the rest of the country assumes is means.

Along with sampling bias, another driving factor in erroneous conclusions from analyzing data is the ‘undocumented confounder.’ Suppose, for example, you wanted to see which coffee people prefer better, that from Starbucks or Dunkin’ Donuts. For this ‘experiment’, we’re interested only in the coffee itself, nothing else. So we have each shop prepare several pots with varying additions like ‘cream only’, ‘light and sweet’, ‘black no sugar’, etc. We then take these to a neutral location and do a side-by-side blind taste comparison. From our taste results we draw some conclusions as to which coffee is more preferred by the sample population. But unbeknownst to us, when the individual shops prepared their various samples of coffee, one shop used brown sugar and one used white sugar, or one used half-and-half while the other used heavy cream. The cream and sugar are now both undocumented confounders of the experiment, possibly driving results one way or the other.

So, back to the elections, how did this year’s political analysts miss the mark? Without knowing their sampling methods, I’m willing to suggest that some form of sample bias or confounder may have played a part. Was it the well known ‘cell-only problem’ again (households with no land-line are less likely to be reached by pollsters)? Did they take into consideration that Trump used Twitter as a means to deliver sound byte like messages to his followers, bypassing the main-stream media’s content filters? Some other factor perhaps as yet unidentified? As technology advances and society trends morph over time, so must political polling and data analysis methods.

Pollsters and data scientists are continually refining their methods of collection, compensation factors and models to eliminate any form of sample bias in order to get closer to the ‘truth.’ My guess is that the election analysts will eventually figure out where they went wrong. After all, they’ve got three years to work it out before the next presidential race starts. Heck, they probably started sloshing through all the data the day after the election!

One needs to realize that data science is just that, a science, and not something that can simply be stepped into without knowledge of the complexities of the discipline. Attempting to do so without the full understanding of sample bias, undocumented confounders and a host of other factors will lead you down the path to a wrong conclusion, aka ‘failure’. History has shown that, for ANY science, there are many failed experiments before a breakthrough. Laboratory scientists need to exercise caution and adhere to strict protocols to keep their work from getting ruined from outside contaminants. The same for data scientists who continually refine collection methods and models for experiments that fail.

So what about the ‘data science’ efforts for YOUR business? Are you trying to predict outcomes based on limited datasets and rudimentary Excel skills, then wondering why you can’t make any sense out of your analysis models? Do you need help identifying and eliminating sample bias, accounting for those pesky ‘undocumented confounders’? Social media sentiment analysis is a big buzz-word these days, with lots of potential for companies to mix this with their own performance metrics. But many just don’t know how to go about it, or are afraid of the cost.

At BlumShapiro Consulting, our team of consultants are constantly looking at the latest trends and technologies associated with data collection and analysis. Some of the same principles associated with election polling can be applied to your organization through predictive analytics and demand planning. Using Microsoft’s Azure framework we can quickly develop a prototype solution that can help take your organization’s data reporting and predicting to the next level.

About Todd: Todd Chittenden started his programming and reporting career with industrial maintenance applications in the late 1990’s. When SQL Server 2005 was introduced, he quickly became certified in Microsoft’s latest RDBMS technology and has added certifications over the years. He currently holds an MCSE in Business Intelligence. He has applied his knowledge of relational databases, data warehouses, business intelligence and analytics to a variety of projects for BlumShapiro since 2011. 

Data scientist

A Digital Transformation – From the Printing Press to Modern Data Reporting

Imagine producing, marketing and selling a product that has only a four-hour shelf life! After four hours, your product is no longer of much value or relevance to your primary consumer. After eight hours, you would be lucky to sell any of the day’s remaining stock. Within 24 hours, nobody is going to buy it; you have to start fresh the next morning. There is such a product line being produced, sold and consumed to millions of people around the world every day. And it’s probably more common than you think.

It’s the daily newspaper.

With such a tight production schedule, news printers have always been under the gun to be able to take the latest news stories and turn them into a finished printed product quickly. Mechanization and automation have pretty much made the production of the modern daily paper a non-event, but it has not always been that way.

150 years ago, the typesetter (someone who set your words, or ‘type’, into a printing press) was the key to getting your printed paper mass produced. With typesetters working faster than your competitors, you could get your product, your story, out to your consumers faster, gaining market share. However, it was still very much a manual process. In the late 1800s the stage was set for a faster method of setting type. One such machine, the Paige Compositor, was as big as a mini-van and had about 15,000 moving parts. (Samuel Clemens, a.k.a. Mark Twain, invested hundreds of thousands of dollars in the failed invention, leading to his financial ruin.) On a more personal scale and at the modern end of the spectrum, we think nothing of sending our finished work, perhaps the big annual report, off to the color printer or ‘office machine’, or upload it to a local printing vendor who will print, collate and bind the whole job for us in a fraction of the time it would take a typesetter to layout even the first page!

So why am I telling you all this? It’s certainly not for a history lesson. The point is that the printed news industry went through a transformation from nothing (monks with quill pens), to ‘mechanization’ (Gutenberg’s printing press), to ‘automation and finally to ‘digitalization.’ And, they had to do so as the news consumer evolved from wanting their printed subscription on a monthly basis, down to the weekly, to the daily and even to the ‘morning’ and ‘evening’ editions. Remember, after four hours, the product is going stale and just about useless. (We could debate whether the faster technologies was what drove news consumers to want information faster, or if the needs of the consumer inspired the advancements in technology, but we won’t.)

Data and reporting has followed the same phases of transformation, albeit not along a much accelerated time span. The modern data consumer is no longer satisfied with having to request a green-bar, tractor fed report from the mainframe, then wait overnight for the ‘job’ to get scheduled and run. They’re not even satisfied with receiving a morning email report with yesterday’s data, or even being able to get the latest analytics report from the server farm on demand. No, they want it now, they want it in hand (smart phones), and they want it concise and relevant. To satisfy this market, products are popping up that fill this need in today’s data reporting market. Products like Microsoft’s Power BI can deliver data quickly and efficiently and in the mobile format demanded due to the industry’s transformation to digital processing. Technologies in Microsoft’s Azure cloud services such as Stream Analytics, coupled with Big Data processing, Machine Learning and Event Hubs have the capabilities to push data in real time to Power BI. I’ll never forget the feeling of elation I had upon completing a simple real-time Azure solution that streamed data every few seconds from a portable temperature sensor in my hand to a Power BI Dashboard. It must have been something like Johannes Gutenberg felt after that first page rolled off his printing press.

Gutenberg and Clemens would be amazed at the printing technology available today to the everyday consumer, yet we seem to take it for granted. Having gone through some of the transformation phases with regard to information delivery myself (yes, I do in fact recall 11×17 green-bar tractor-fed reports) I tend to be amazed at what technologies are being developed these days. Eighteen months ago (an eon in technology life) the Apple watch and Power BI teamed up to deliver KPI’s right on the watch! What will we have in another eighteen months? I can’t wait to find out.

About Todd: Todd Chittenden started his programming and reporting career with industrial maintenance applications in the late 1990’s. When SQL Server 2005 was introduced, he quickly became certified in Microsoft’s latest RDBMS technology and has added certifications over the years. He currently holds an MCSE in Business Intelligence. He has applied his knowledge of relational databases, data warehouses, business intelligence and analytics to a variety of projects for BlumShapiro since 2011. 

Power BI Demo CTA

How Much is Your Data Worth?

Data is the new currency in today’s modern businesses. From the largest international conglomerate down to the smallest neighborhood Mom-and-Pop shop, data is EVERYTHING! Without data, you don’t know who to bill for services, or for how much. You don’t know how much inventory you need on hand, or who to buy it from if you run out. Seriously, if you lost all of your data, or even a small but vitally important piece of it, could your company recover? I’m guessing not.

“But,” you say, “We have a disaster recovery site we can switch to!”

That’s fine if your racks melt down into a pool of heavy metals on the server room floor, then yes, by all means switch over to your disaster recovery site because molten discs certainly qualify as a “disaster!” Databases hosted on private or public cloud virtual machines are less susceptible, but not immune, to hardware failures.  But what about a failure of a lesser nature? What if one of your production databases gets corrupted because of a SQL Injection hack, cleaned out by a disgruntled employee, or is accidentally purged because a developer thought he was working against the DEV environment? Inadvertent changes to data are no respecter of where such data is stored, or how it is stored! And, sorry to say, clustering or other HADR solutions (High Availability/Disaster Recovery, such as SQL Server Always On technology) may not be able to save you in some cases. Suppose some data gets deleted or is modified in error. These ‘changes’, be they accidental or on purpose, may get replicated to the inactive node of the cluster before the issue is discovered. After all, the database system doesn’t know if it should stop such changes from happening when the command to modify data is issued. How can it tell an ‘accidental purge’ from regular record maintenance? So the system replicates those changes to the failover node. You end up with TWO copies of an incorrect database instead of one good one and one bad! And worse yet, depending on your data replication latency from your primary site to the disaster recovery site, and how quickly you stop the DR site from replicating, THAT may get hosed too if you don’t catch it in time!

Enter the DATABASE BACKUP AND RESTORE, the subject of this article. Database backups have been around as long as Relational Database Management Systems (RDBMS). In my humble opinion, a product cannot be considered a full-featured RDBMS unless it has the capability of performing routine backups and allows for granular restore to a point in time. (Sorry, but Microsoft Excel and Access simply do not qualify.) Being a Microsoft guy, I’m going to zero in on their flagship product: SQL Server, but Oracle, SAP, IBM and many others will have similar functionality. (See the Gartner Magic Quadrant for database systems for a quick look at various vendors, including Microsoft a clear leader in this Magic Quadrant.)

So what is a BACKUP? “Is it not simply a copy of the database?” you say, “I can make file copies of my Excel spreadsheet. Isn’t that the same as a backup?” Let me explain how database backups work and then you can decide the answer to that question.

First of all, you’ll need the system to create a FULL database backup. This is a file generated by the database server system, stored on the file system, the format of which is proprietary to the system. Typically, full backups are taken once per night for a moderately sized database, for example under 100 GB, and should be handled via an automated scheduling service such as SQL Agent.

iStock_000006412772XSmallNext, you’ll need TRANSACTION LOG backups. Log backups, as they are known, record every single change in the database that has occurred since the last full or log backup. A good starting point is scheduling log backups at least every hour, with possible tightening down to every few minutes if the database is extremely active.

Now, to restore a database in the event of a failure, you need to do one very important step: backup the transaction log one last time if you want to have any hope of restoring to a recent point. To perform the actual restore, you’ll need what is known as the ‘chain of backups’ which includes the most recent full backup and every subsequent log backup. During the restore, you will be able to specify a point in time anywhere from the time of the full backup to the time of the latest log backup, right down to the second or millisecond.

So we’re all set right? Almost. The mantra of Database Administrators the world over regarding backups is this: “The backups are only as good and sure as the last time we tested the RESTORE capability.” In other words, if you haven’t tested your ability to restore your database to a particular point in time, you can’t be sure you’re doing it right. Case in point: I saw a backup strategy once where the FULL backups were written directly to a tape drive every night, then first thing in the morning, the IT guys would dutifully eject the tapes and immediately ship them out to an off-site storage location. How can you restore a database if your backups are not available? Case two: The IT guys, not understanding SQL backup functionality and benefits, used a third party tool to take database backups, but didn’t bother with the logs. After four years of this, they had a log that was 15 times the size of the database! So big, in fact, that there was no space available to hold its backup. About a year after I got the situation straightened out with regular full AND transaction log backups going, the physical server (virtualization was not common practice then) experienced a debilitating hardware failure and the whole system was down for three days. Once running again, the system (a financials software package with over 20,000 tables!) was restored to a point in time right before the failure. Having the daily FULL backups saved the financials system (and the company). But also having the log backups saved many people a day’s work if we had had to go back to the latest FULL backup.

So, what’s your data worth? If your data is critical to your business, it is critical that you properly back up the data. Talk to us to learn how we can help with this.

About Todd: Todd Chittenden started his programming and reporting career with industrial maintenance applications in the late 1990’s. When SQL Server 2005 was introduced, he quickly became certified in Microsoft’s latest RDBMS technology and has added certifications over the years. He currently holds an MCSE in Business Intelligence. He has applied his knowledge of relational databases, data warehouses, business intelligence and analytics to a variety of projects for BlumShapiro since 2011. 

Technology Talks Newsletter CTA

Electrify Your Business with Data

Data is a lot like electricity – a little bit at a time, not too much direct contact, and you’re fine. For example, a single nine-volt battery doesn’t provide enough power to light a single residential bulb. In fact, it would take about a dozen nine-volt batteries to light that single bulb, and it would only last about an hour. It’s only when you get massive amounts of electricity flowing in a controlled way that you can see real results, like running electric motors, lighting up highway billboards, heating an oven or powering commuter trains.

iStock_000006412772XSmallIt’s the same with data. The sale of a single blue shirt at a single outlet store is not much data. And it’s still not much even when you combine it with all the sales for that store in a single day. But what about a year’s worth of data, from multiple locations? Massive amounts of data can do amazing things for us as well. We have all seen in today’s data centric business environment what controlled usage of data can do.

Some examples include:

  • The National Oceanic and Atmospheric Administration (NOAA) can now more accurately predict a hurricane’s path thanks to data that has been collected over time
  • Marketing firms can save money on culled down distribution lists based on customer demographics, shopping habits and preferences.
  • Medical experts can identify and treat conditions and diseases much better based on a patient’s history, life risks and other factors.
  • Big ‘multi-plex’ movie houses can predict more accurately the number of theatres it will need to provision for the latest summer block buster by analyzing Twitter and other social media feeds as related to the movie.

All of this can be done thanks to controlled data analytics.

The key word here is “controlled.” With a background in marine engineering and shore-side power generation, I have seen my share of what can happen when electricity and other sources of energy are not kept ‘controlled.’ Ever see what happens when a handful of welding rods go through a steam turbine spinning at 36,000 RPM and designed for tolerances of thousandths of an inch? It’s not pretty. After as many years in database technologies, data analysis and visualizations, I have also seen the damage resulting from large quantities of uncontrolled data. In his book Signal: Discerning What Matters Most in a World of Noise, author Steven Few shows a somewhat tongue-in-cheek graph that ‘proves’ that the sale of ice cream is a direct cause of violent crime. Or was it the other way around?  It’s an obvious comic hyperbole that serves to illustrate his point that we need to be careful of how we analyze and correlate data.

With the ‘big data‘ explosion, proponents will tell you that ‘if a little is good, then more is better.’ It’s an obvious extension, but is it accurate? Is there such a thing as ‘too much data’?

Let’s say you are a clothing retail store in the mall. Having data for all of your sales over the past ten years, broken down by item, store, date, salesperson and any number of other dimensions may be essential. What if we were to also include ALL the sales of ALL competitors’ products, seasonal weather history, demographic changes, foot traffic patterns in the mall and just about anything else that could influence a customer’s decision to buy your product even down to what they had for lunch just before they made the purchase? The result would most likely be UN-controlled data analysis. This tends to lead to erroneous correlations and bad decisions. For instance, you just might discover that customers are four times as likely to make a purchase if they had pizza for lunch, never realizing that there are more pizza restaurants near your stores than any other type of food service!

When it comes to data, stick with what you know is good data. Make sure it’s clean, reliable and most of all, relevant. Most of all, make sure you CONTROL your data. Without control, there may be no stopping the damage that results.

About Todd: Todd Chittenden started his programming and reporting career with industrial maintenance applications in the late 1990’s. When SQL Server 2005 was introduced, he quickly became certified in Microsoft’s latest RDBMS technology and has added certifications over the years. He currently holds an MCSE in Business Intelligence. He has applied his knowledge of relational databases, data warehouses, business intelligence and analytics to a variety of projects for BlumShapiro since 2011.

BI Call to Action