The Big Data journey has to start somewhere. My observation in talking to Microsoft technologists is that, while Big Data is fascinating and exciting, they don’t know where to start. Should we start by learning Hadoop? R? Python?
Before we jump into tools, let’s understand how data science works and what can be gained from it. By now, you understand that Predictive Analytics (or Machine Learning) is a relatively new branch of Business Intelligence. Instead of asking how our business/department/employee has been performing (recently, and as compared to historical trends), we are now seeking to predict what will happen in the future, based upon data collected in the past. We can do this at a very granular level. We can identify “which thing” will behave “which way”. Some examples: which customer is likely to cancel their subscription plan, which transactions are fraudulent, which machine on the factory floor is about to fail.
There are several approaches to applying statistics and mathematics to answer these questions. In this blog post, I will focus on two data science tasks: Classification and Regression.
Classification is used to predict which of a small set of classes a thing belongs to. Ideally, the classes are a small set and mutually exclusive (Male or Female, Republican or Democrat, Legitimate or Fraudulent). They need not be “either/or”, but it is easiest to think of them in that manner.
Closely related to Classification is the task of predicting the probability that the thing is classified that way. This is called Class Probability Estimation. We can determine that a transaction is “Legitimate” with 72.34% certainty, for example.
What can be gained from Classification? There are many iconic stories of how forward thinking companies anticipating business issues before they arrive – and then take action. My favorite is story Signet Bank, whose credit card division was unprofitable, due to “bad” customer defaults on loans and “good” customers being lost to larger financial institutions who could offer better terms and conditions. The answer, revolutionary at the time, was to apply Classification to their customer data. They separated the “Bad” from the “Good”, cut the “Bad” ones loose and nurtured the “Good” ones with offers and incentives. Today, we know them as Capital One.
Regression, on the other hand, is a task used to estimate some numeric value of some variable for some thing. For example, “How much should I expect to pay for a given commodity?” or “How hot will the temperature be in my home before a human turns the heat down?” This is often confused with Class Probability Estimation. Classification is related to Regression, but they have different goals. Classification is for determining whether something will happen. Regression is for determining how much of something will happen.
What can be gained from Regression? In manufacturing, it is very useful to understand how much use a particular machine part should be expected to deliver, before performance degrades below an acceptable tolerance level. Any financial services firm does this routinely to price securities and options.
In my next blog, I will discuss other data science tasks which are related to “Customers who bought this, also bought that”.