Data Science Foundations – Similarity Matching and Clustering

What makes a good Data Scientist?  A good data scientist is a software engineer with a solid background in statistics, or a Statistician who likes to code.  I am a software engineer with a solid background in statistics.  I thought to share my knowledge of statistics in this blog, focusing on important foundational tasks which every software engineer/statistician needs to know.

In my last post, I introduced the very basics: Classification and Regression.  In this blog post, I want to talk about some methods which are statistical in nature, and can also be used in Data Quality exercises.  They are Similarity Matching and Clustering. Both can be helpful to Data Quality and Data Governance teams who are looking to reduce data duplication, but also to predict correct attribute values in the absence of authoritative data.

Similarity Matching is a foundational task which can support classification and regression activities later.  Here, we are trying to identify similar data members based upon the known attributes of those data members.  Examples: a company may use similarity matching to find new customers which resemble closely their very best customers - they can be targeted for special offers or other customer retention strategies.  Or, a company may look for similarities in data across raw materials from vendors to optimize costs.

Clustering is another foundation task, in that it can be preliminary to further exercises.  Clustering attempts to find natural groupings of data entities, without necessarily being driven by a particular purpose.  The results can be an input to decision making machine learning: what products or services should we offer these customers?  is the population large enough to market to specifically?

In my next post, I’ll continue differentiating data science tasks by character and purpose.  Many tasks are related and so we’ll talk about some which complement others already under discussion.

Adding User Configurations to an Analysis Server Cube

Part 4: Default Configurations with Overrides

In the first parts of this article, we looked at a few methods of getting user controlled configuration values into a cube such that the value can be changed simply by changing a value in a table. (I could have easily labeled these as “data driven solutions” but feel that term is grossly overused these days.) All the methods outlined previously could suffice if the requirements allowed, but eventually, the requirements are going to overtake the capabilities of the design and you will be left with some rework.

To add some complexity to our requirements, the business users have handed down the following:

  1. A default configuration value
  2. A place to enter override values, but ONLY for the date upon which the new value is to take affect
  3. Ability to easily change the effective date of a particular value
  4. Ability to revert back to the default value on a particular date
  5. And just to make it interesting, the granularity needs to be at the Fiscal Week level, not at the Date level.

In this fourth and final installment of this series, we’ll meet all the requirements above by pulling out some fancy T-SQL. We used the PIVOT clause in the last solution. This one will add the LEAD windowing function.

The method outlined in Part 3 worked to a point, but if the single value was ever changed, it would change it for all of time. That falls short of our new requirements. Explained in a little more detail: What we need is a way to enter a starting (default) value (#1), and then any new values. Also, the lazy business users only want to have to put in the new value ONCE and have THAT value be the ‘new default’, to remain in effect until changed again (#2). Item #3 says they want a way to, in essence, slide the effective date forward or backward in time. They also want a way to revert back to the default, whatever that default happens to be, without even know the original default value (#4). And finally, they want it at the Fiscal Week level, not the Date or Day level (#5).

If you missed it in Part 2, we will be using a Configuration table and an abbreviated Date dimension table:

 

Next, and new to this solution, we’ll need an override table that joins the Date Key and Configuration ID as follows:

As with any table, we’ll need some sample data:

Note that I am only inserting one day per fiscal week when in reality there would be 7.

Again from Part 2, We’ll add the two values to the main configuration table:

Now add a couple of override values to the override table

For the sample override values above, I have purposefully staggered the overrides between ID’s 1 and 2 so they do not happen on the same week.

 

To bring it all together, we’ll build a view one block at a time. Start with a SELECT statement that uses the LEAD function as follows:

In the query above, the [EndWeek] column represents the [FiscalWeekKey] of the NEXT Override Value for that particular [ConfigurationID] based on the “PARTITION BY ConfigurationID” statement. If that [EndWeek] values happens to be null, it means that is the last entry chronologically for that configuration ID.

We’ll also need a list of distinct Fiscal Weeks from the dimDate table:

Finally, wrapping the above two statements in respective common table expressions, we’ll bring them all together. As before, the CROSS JOIN between WEEKS and the Configuration view yields the Cartesian product of Weeks and Configurations. In our case, 10 weeks and 2 configurations yields 20 rows. But also note the ON clause of the LEFT OUTER JOIN to the [OVR] table: We’re making an equal join on [Configuration ID] (no surprise there) and the next two lines make sure that the override value is joined to the correct Fiscal Weeks. If there are 4 weeks that a particular Override Value is in effect, it is these two AND lines in the joining that will make sure it happens.

Also note the first ISNULL ( , ) function to get the default value if the row is missing an override value. To see this in action, add the following row to the CubeConfigurationTable and re-run the SELECT statement.

Note that as of Week 201348, Configuration ID 2 (“My Floating Configuration”) reverts back to the default of 1.2345

 

And lastly, we need to PIVOT the whole mess by adding the appropriate T-SQL clause at the end:

Now, by adding additional entries to the Override table, users can specify when a new value becomes effective. After the entry is added, editing its FiscalWeekKey will change when it becomes effective.

Recap:

Leveraging an existing Date dimension table, and adding two simple tables for default and override configuration values, we have a view that meets all the requirements. Also, to slide the effective date of any one configuration override involves simply editing the week it becomes effective. It will remain in effect until it is overridden again by another entry.

To add a third Configuration value to the cube would involve the following actions:

  1. Adding a row to the Configuration table, and any overrides to the Override table.
  2. Editing the PIVOT clause of the view to include the value as a pivoted column.
  3. Refreshing the cube’s Data Source View to include the new column from the view.
  4. Adding the item as a new measure in the Configuration Measure Group.

 

 

 

 

 

Bringing Data to Life with Excel Power Map

If you have spent any time building, designing, viewing, or otherwise interacting with data visualizations, sooner or later you are going to come across Charles Joseph Minard’s iconic graph of Napoleon’s ‘Grande Armée’ as it advances on, and retreats from, Moscow in the fall and winter of 1812.

If you are unfamiliar with the graph, the wide yellow band represents the size of the army at various points along the advance, and the black band represents the same during the retreat, which is accompanied by the temperature plot along the bottom. To put the numbers in perspective, the army started at 400,000 men, reached Moscow with 100,000, and wound up in the end with 10,000.

The graph is stunning in and of itself, especially when one considers that it is about 200 years old. I have kept a post-card sized copy pinned to my cubicle wall over the last 10 years as an example of what can be done with data and the right visualization.

But for all its richness, the graph is still just a two dimensional piece of paper. Playing with Excel Power Map one day, I thought, “What if I got my hands on Minard’s original data? What could I do with it?” The result, which I will walk the reader through creating in this article, is shown below. Even better is the movie clip that can be created from within Excel, which can be seen here.

Now that you have seen the result, I will walk you through how to get there. And of course, it all starts with data. The dataset I pulled off the internet (http://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/minard.txt) looked a little sparse when I first plotted it. I took some liberties with that original data and interpolated many of the points. This was done by taking two adjacent points in time, determining the average army size, latitude, longitude and date between them and coming up with a third point half way between the two. If that did not fill out the graph sufficiently, I took the averages again, between the middle average point and the two original points, in essence creating ‘quarter points’ or ‘third points’ as needed. Purists may argue that I have destroyed the fidelity of the data, and I would argue back that I’m not attempting complex predictive analytics, only trying to ‘pad’ my graph with enough data points such that it closely resembles Minard’s original. I played around with simple circles, the size of each representing the size of the army at that particular point, but settled on the bar graph instead.

The temperature data was also modified by A) putting all the points on the same latitude and somewhat lower than the army’s path and B) duplicating it in the Fahrenheit scale for those of us not too familiar with the Celsius scale. Because of the up and down nature of the temperature plot, straight line interpolation would not have been appropriate. My resulting data sets of army size and temperatures can be found in the following Excel document. Also of note is the fact that Excel does not recognize dates prior to the year 1900, so all dates in the datasets are moved forward by one century to 1912.

Now to the fun stuff. For this you will need Microsoft Excel 2013 with the Power Map add-in installed (http://www.microsoft.com/en-us/download/details.aspx?id=38395). After installation, activate the add-in by clicking File > Options, select the Add-Ins page, select “COM Add-ins” in the “Manage” combo-box at the bottom, click Go, and enable “Microsoft Power Map for Excel”. Back on your workbook, on the INSERT ribbon click the Map button.

The first thing to do with any mapping exercise is to set the geographic references. On the right side, under Army Stats, check the boxes for Latitude and Longitude, and click Next (not shown). On the next page, drag the Survivors field to the HEIGHT box, the Direction field to the CATEGORY box, and the Date field to the TIME box as show in red below. At this point, after just three mouse clicks and three drag and drop operations, you have a fully interactive map (powered by Bing) with browsing controls as shown in the blue boxes. And we’re just getting started!

Let’s clean some things up. Hover your mouse over the “Tour 1″ title and notice the popup. Change the title to “Napoleon’s March.” Click the Layer Manager icon, then the Change Layer Options icon (gear) next to Layer 1. Rename the layer to “Army Statistics” using the edit icon (pencil) next to the layer name (not shown).

To add the temperature plot is very simple as well. From the Home ribbon, click Add Layer. At this point, you will follow the same steps we did for plotting the army, starting with the geographic references of latitude and longitude, but this time we’ll be taking data from the Temperature fields using those fields in the same way, and plotting Temperature as the HEIGHT, Scale as the CATEGORY, and Date as the TIME. The only difference is to select a Clustered Column visualization instead of the default Stacked Column.

Of course, to be true to Minard’s original, we can further modify the plot colors to yellow and black using the Layer Settings. I have also reduced the height and opacity of the temperature plots and changed their colors so as to better contrast the army.

The result is good, but unfortunately still shows 1812 data plotted on modern day Europe. We need to fix this. On the Tour editor (left side of the screen), click the Change Scene Options button. Then in the right side, click Change Map Type. Select New Custom Map, in the dialog box, click the Browse icon, and locate this map of Russia 1812.jpg. After applying the custom map, don’t be discouraged by how your data looks; we’ll need to make a few adjustments. Set the X Min and Max values to 19 and 41, and Y Min and Max to 50 and 60, respectively and the Y Scale to 120. Click Apply to view how the changes affect your map. Make adjustments as necessary until the army plot starts at the Niemen River on the left and ends at Moscow on the right. Your final settings may be different depending on your hardware and resolution, etc. It should be noted that the jpg referenced is a conical projection and Excel does not allow ‘bending’ the plot to coincide with the curved lines of latitude, or converging lines of longitude which are clearly visible on the map. Not much we can do about that. It’s the best map of 1812 Russia I could find on the internet.

(Disclaimer: Custom Maps may not be available in your current version of Power Maps. As of September 2014, it was only available to Office 365 clients and not included in the latest download as posted at the start of this article. I don’t know if that has been rectified.)

The last thing we’re going to do is to capture the playback video and add a soundtrack. To set the playback speed, on the Play Axis toolbar at the bottom, click the settings icon. Slide the Speed control at the bottom until the Scene duration reaches just over 60 seconds. Alternatively, use the spinner control to fine tune the duration in seconds. For the soundtrack, what better music than Tchaikovsky’s iconic 1812 Overture, which, by the way, was commissioned by the Tsar to commemorate Napoleon’s defeat at this very battle! An excerpt sound clip can be found below, and is the final 61 seconds of the timeless classic.

From the Home ribbon, click Create Video (second button from the left). Choose a quality and click Soundtrack Options. This is fairly straight forward from here. For the video in this start of this article, I selected the middle quality (Computers and Tablets at 720p), and removed the option for looping the soundtrack.

As you can see, using Excel Power Map is easy, and can yield exciting visualizations, even with 200 year old data!

Enjoy!

 

 

 

Data Science Foundations – Classification and Regression

The Big Data journey has to start somewhere.  My observation in talking to Microsoft technologists is that, while Big Data is fascinating and exciting, they don’t know where to start.  Should we start by learning Hadoop?  R? Python?

Before we jump into tools, let’s understand how data science works and what can be gained from it.  By now, you understand that Predictive Analytics (or Machine Learning) is a relatively new branch of Business Intelligence.  Instead of asking how our business/department/employee has been performing (recently, and as compared to historical trends), we are now seeking to predict what will happen in the future, based upon data collected in the past.  We can do this at a very granular level.  We can identify “which thing” will behave “which way”.  Some examples: which customer is likely to cancel their subscription plan, which transactions are fraudulent, which machine on the factory floor is about to fail.

There are several approaches to applying statistics and mathematics to answer these questions.  In this blog post, I will focus on two data science tasks: Classification and Regression.

Classification is used to predict which of a small set of classes a thing belongs to.  Ideally, the classes are a small set and mutually exclusive (Male or Female, Republican or Democrat, Legitimate or Fraudulent).   They need not be “either/or”, but it is easiest to think of them in that manner.

Closely related to Classification is the task of predicting the probability that the thing is classified that way.  This is called Class Probability Estimation.  We can determine that a transaction is “Legitimate” with 72.34% certainty, for example.

What can be gained from Classification?  There are many iconic stories of how forward thinking companies anticipating business issues before they arrive – and then take action.  My favorite is story Signet Bank, whose credit card division was unprofitable, due to “bad” customer defaults on loans and “good” customers being lost to larger financial institutions who could offer better terms and conditions.  The answer, revolutionary at the time, was to apply Classification to their customer data.  They separated the “Bad” from the “Good”, cut the “Bad” ones loose and nurtured the “Good” ones with offers and incentives.  Today, we know them as Capital One.

Regression, on the other hand, is a task used to estimate some numeric value of some variable for some thing.  For example, “How much should I expect to pay for a given commodity?”  or “How hot will the temperature be in my home before a human turns the heat down?” This is often confused with Class Probability Estimation.  Classification is related to Regression, but they have different goals.  Classification is for determining whether something will happen.  Regression is for determining how much of something will happen.

What can be gained from Regression?  In manufacturing, it is very useful to understand how much use a particular machine part should be expected to deliver, before performance degrades below an acceptable tolerance level.  Any financial services firm does this routinely to price securities and options.

In my next blog, I will discuss other data science tasks which are related to “Customers who bought this, also bought that”.