Tag Archive for Data Science

The 4 Machine Learning Problems, Explained

Machine Learning and Predictive Analytics have been receiving a lot of attention lately!  Without question, this is an exciting technology with extremely broad applicability.  After all, who wouldn’t want to be able to predict the future?  Still, with hype comes confusion, and there is a lot of confusion today about what exactly Machine Learning is and how to use it.

I have good news!  There are really only 4 (yes, four) Machine Learning problems.  For anyone who wants to explore the value of Machine Learning, it’s important to understand them, because the first step in any Machine Learning process is to figure out which of these problems you are trying to solve.  Data Science teams address this question before they begin designing a Machine Learning model.  If your problem does not fit into one of these buckets, forget the hype! You’re better off taking a simpler approach.

Classification – in this machine learning problem, we’re trying  to figure out if some bit of data (an observation) represents something simple which we already understand (a Label).  This label can either be a Yes or No decision, (Two Class) or it can be one of a set of possible answers (Multi Class).  In order for this to work well, you need to provide the Machine Learning model with examples first.  Applications include:

  1. Facial Recognition – is this picture an image of my customer?
  2. Voice Recognition – what word is represented by this sound?
  3. Handwriting Recognition – which letter in the alphabet does this image represent?
  4. Fraud Detection – is this transaction fraudulent?
  5. Medical Outcomes – will this person have a stroke in the next year?
  6. Proactive Maintenance – will this piece of machinery fail in the next 72 hours?
  7. Credit Default Risk – will this borrower default on his/her loan?

Regression – in this machine learning problem, a Yes or No answer is not going to be enough.  In order to solve this problem, the machine needs to predict a value (i.e. a price, a temperature, a measurement) by understanding the numeric relationship of that value to other values (or Factors).  If you took Calculus, this might sound like a simple “Rate of Change” function: you’re on the right track.  Just as with Classification, Regression problems need some examples in order to work well.  Applications include:

  1. Cost Analysis – when will be the best time to buy something?
  2. Demand Prediction – how many widget’s will we sell next year?

Clustering – this is where things get complicated (!!)  With the first two problems, we have examples we can use to “train” our machines to predict a label AND we can test them with labeled observations (known to Data Scientists as “Ground Truth”).  But what if we don’t have a ground truth?  The best we can do is identify clusters of observations.   Fair warning: without ground truth, evaluating the results will be a challenge.  Still, some applications include:

  1. Grouping of Content – Grouping Today’s News into Categories, or Documents into Topics
  2. Materials Classification – take a Raw Materials Master File and organize it into a taxonomy
  3. Customer Segmentation – identify similar customers based upon purchase behavior

Recommender – have you ever been on a website which presented a recommendation of something you might “Like”?  Movie recommendations on Netflix, product recommendations on Amazon, or advertisements on your apps – if you are familiar with the internet, you probably understand the premise here.

That’s it.  Now you know how to recognize a problem which Machine Learning can help you with.   If your business problem does not fall into one of these four, you don’t need a machine learning model to solve it.  More importantly, if you know the factors which drive a business outcome, just build a model in Excel – you don’t need a Data Science team for that.

Good luck!

Data Science Foundations – Similarity Matching and Clustering

What makes a good Data Scientist?  A good data scientist is a software engineer with a solid background in statistics, or a Statistician who likes to code.  I am a software engineer with a solid background in statistics.  I thought to share my knowledge of statistics in this blog, focusing on important foundational tasks which every software engineer/statistician needs to know.

In my last post, I introduced the very basics: Classification and Regression.  In this blog post, I want to talk about some methods which are statistical in nature, and can also be used in Data Quality exercises.  They are Similarity Matching and Clustering. Both can be helpful to Data Quality and Data Governance teams who are looking to reduce data duplication, but also to predict correct attribute values in the absence of authoritative data.

Similarity Matching is a foundational task which can support classification and regression activities later.  Here, we are trying to identify similar data members based upon the known attributes of those data members.  Examples: a company may use similarity matching to find new customers which resemble closely their very best customers – they can be targeted for special offers or other customer retention strategies.  Or, a company may look for similarities in data across raw materials from vendors to optimize costs.

Clustering is another foundation task, in that it can be preliminary to further exercises.  Clustering attempts to find natural groupings of data entities, without necessarily being driven by a particular purpose.  The results can be an input to decision making machine learning: what products or services should we offer these customers?  is the population large enough to market to specifically?

In my next post, I’ll continue differentiating data science tasks by character and purpose.  Many tasks are related and so we’ll talk about some which complement others already under discussion.

Data Science Foundations – Classification and Regression

The Big Data journey has to start somewhere.  My observation in talking to Microsoft technologists is that, while Big Data is fascinating and exciting, they don’t know where to start.  Should we start by learning Hadoop?  R? Python?

Before we jump into tools, let’s understand how data science works and what can be gained from it.  By now, you understand that Predictive Analytics (or Machine Learning) is a relatively new branch of Business Intelligence.  Instead of asking how our business/department/employee has been performing (recently, and as compared to historical trends), we are now seeking to predict what will happen in the future, based upon data collected in the past.  We can do this at a very granular level.  We can identify “which thing” will behave “which way”.  Some examples: which customer is likely to cancel their subscription plan, which transactions are fraudulent, which machine on the factory floor is about to fail.

There are several approaches to applying statistics and mathematics to answer these questions.  In this blog post, I will focus on two data science tasks: Classification and Regression.

Classification is used to predict which of a small set of classes a thing belongs to.  Ideally, the classes are a small set and mutually exclusive (Male or Female, Republican or Democrat, Legitimate or Fraudulent).   They need not be “either/or”, but it is easiest to think of them in that manner.

Closely related to Classification is the task of predicting the probability that the thing is classified that way.  This is called Class Probability Estimation.  We can determine that a transaction is “Legitimate” with 72.34% certainty, for example.

What can be gained from Classification?  There are many iconic stories of how forward thinking companies anticipating business issues before they arrive – and then take action.  My favorite is story Signet Bank, whose credit card division was unprofitable, due to “bad” customer defaults on loans and “good” customers being lost to larger financial institutions who could offer better terms and conditions.  The answer, revolutionary at the time, was to apply Classification to their customer data.  They separated the “Bad” from the “Good”, cut the “Bad” ones loose and nurtured the “Good” ones with offers and incentives.  Today, we know them as Capital One.

Regression, on the other hand, is a task used to estimate some numeric value of some variable for some thing.  For example, “How much should I expect to pay for a given commodity?”  or “How hot will the temperature be in my home before a human turns the heat down?” This is often confused with Class Probability Estimation.  Classification is related to Regression, but they have different goals.  Classification is for determining whether something will happen.  Regression is for determining how much of something will happen.

What can be gained from Regression?  In manufacturing, it is very useful to understand how much use a particular machine part should be expected to deliver, before performance degrades below an acceptable tolerance level.  Any financial services firm does this routinely to price securities and options.

In my next blog, I will discuss other data science tasks which are related to “Customers who bought this, also bought that”.