Tag Archive for Big Data

A Digital Transformation – From the Printing Press to Modern Data Reporting

Imagine producing, marketing and selling a product that has only a four-hour shelf life! After four hours, your product is no longer of much value or relevance to your primary consumer. After eight hours, you would be lucky to sell any of the day’s remaining stock. Within 24 hours, nobody is going to buy it; you have to start fresh the next morning. There is such a product line being produced, sold and consumed to millions of people around the world every day. And it’s probably more common than you think.

It’s the daily newspaper.

With such a tight production schedule, news printers have always been under the gun to be able to take the latest news stories and turn them into a finished printed product quickly. Mechanization and automation have pretty much made the production of the modern daily paper a non-event, but it has not always been that way.

150 years ago, the typesetter (someone who set your words, or ‘type’, into a printing press) was the key to getting your printed paper mass produced. With typesetters working faster than your competitors, you could get your product, your story, out to your consumers faster, gaining market share. However, it was still very much a manual process. In the late 1800s the stage was set for a faster method of setting type. One such machine, the Paige Compositor, was as big as a mini-van and had about 15,000 moving parts. (Samuel Clemens, a.k.a. Mark Twain, invested hundreds of thousands of dollars in the failed invention, leading to his financial ruin.) On a more personal scale and at the modern end of the spectrum, we think nothing of sending our finished work, perhaps the big annual report, off to the color printer or ‘office machine’, or upload it to a local printing vendor who will print, collate and bind the whole job for us in a fraction of the time it would take a typesetter to layout even the first page!

So why am I telling you all this? It’s certainly not for a history lesson. The point is that the printed news industry went through a transformation from nothing (monks with quill pens), to ‘mechanization’ (Gutenberg’s printing press), to ‘automation and finally to ‘digitalization.’ And, they had to do so as the news consumer evolved from wanting their printed subscription on a monthly basis, down to the weekly, to the daily and even to the ‘morning’ and ‘evening’ editions. Remember, after four hours, the product is going stale and just about useless. (We could debate whether the faster technologies was what drove news consumers to want information faster, or if the needs of the consumer inspired the advancements in technology, but we won’t.)

Data and reporting has followed the same phases of transformation, albeit not along a much accelerated time span. The modern data consumer is no longer satisfied with having to request a green-bar, tractor fed report from the mainframe, then wait overnight for the ‘job’ to get scheduled and run. They’re not even satisfied with receiving a morning email report with yesterday’s data, or even being able to get the latest analytics report from the server farm on demand. No, they want it now, they want it in hand (smart phones), and they want it concise and relevant. To satisfy this market, products are popping up that fill this need in today’s data reporting market. Products like Microsoft’s Power BI can deliver data quickly and efficiently and in the mobile format demanded due to the industry’s transformation to digital processing. Technologies in Microsoft’s Azure cloud services such as Stream Analytics, coupled with Big Data processing, Machine Learning and Event Hubs have the capabilities to push data in real time to Power BI. I’ll never forget the feeling of elation I had upon completing a simple real-time Azure solution that streamed data every few seconds from a portable temperature sensor in my hand to a Power BI Dashboard. It must have been something like Johannes Gutenberg felt after that first page rolled off his printing press.

Gutenberg and Clemens would be amazed at the printing technology available today to the everyday consumer, yet we seem to take it for granted. Having gone through some of the transformation phases with regard to information delivery myself (yes, I do in fact recall 11×17 green-bar tractor-fed reports) I tend to be amazed at what technologies are being developed these days. Eighteen months ago (an eon in technology life) the Apple watch and Power BI teamed up to deliver KPI’s right on the watch! What will we have in another eighteen months? I can’t wait to find out.

About Todd: Todd Chittenden started his programming and reporting career with industrial maintenance applications in the late 1990’s. When SQL Server 2005 was introduced, he quickly became certified in Microsoft’s latest RDBMS technology and has added certifications over the years. He currently holds an MCSE in Business Intelligence. He has applied his knowledge of relational databases, data warehouses, business intelligence and analytics to a variety of projects for BlumShapiro since 2011. 

Power BI Demo CTA

Electrify Your Business with Data

Data is a lot like electricity – a little bit at a time, not too much direct contact, and you’re fine. For example, a single nine-volt battery doesn’t provide enough power to light a single residential bulb. In fact, it would take about a dozen nine-volt batteries to light that single bulb, and it would only last about an hour. It’s only when you get massive amounts of electricity flowing in a controlled way that you can see real results, like running electric motors, lighting up highway billboards, heating an oven or powering commuter trains.

iStock_000006412772XSmallIt’s the same with data. The sale of a single blue shirt at a single outlet store is not much data. And it’s still not much even when you combine it with all the sales for that store in a single day. But what about a year’s worth of data, from multiple locations? Massive amounts of data can do amazing things for us as well. We have all seen in today’s data centric business environment what controlled usage of data can do.

Some examples include:

  • The National Oceanic and Atmospheric Administration (NOAA) can now more accurately predict a hurricane’s path thanks to data that has been collected over time
  • Marketing firms can save money on culled down distribution lists based on customer demographics, shopping habits and preferences.
  • Medical experts can identify and treat conditions and diseases much better based on a patient’s history, life risks and other factors.
  • Big ‘multi-plex’ movie houses can predict more accurately the number of theatres it will need to provision for the latest summer block buster by analyzing Twitter and other social media feeds as related to the movie.

All of this can be done thanks to controlled data analytics.

The key word here is “controlled.” With a background in marine engineering and shore-side power generation, I have seen my share of what can happen when electricity and other sources of energy are not kept ‘controlled.’ Ever see what happens when a handful of welding rods go through a steam turbine spinning at 36,000 RPM and designed for tolerances of thousandths of an inch? It’s not pretty. After as many years in database technologies, data analysis and visualizations, I have also seen the damage resulting from large quantities of uncontrolled data. In his book Signal: Discerning What Matters Most in a World of Noise, author Steven Few shows a somewhat tongue-in-cheek graph that ‘proves’ that the sale of ice cream is a direct cause of violent crime. Or was it the other way around?  It’s an obvious comic hyperbole that serves to illustrate his point that we need to be careful of how we analyze and correlate data.

With the ‘big data‘ explosion, proponents will tell you that ‘if a little is good, then more is better.’ It’s an obvious extension, but is it accurate? Is there such a thing as ‘too much data’?

Let’s say you are a clothing retail store in the mall. Having data for all of your sales over the past ten years, broken down by item, store, date, salesperson and any number of other dimensions may be essential. What if we were to also include ALL the sales of ALL competitors’ products, seasonal weather history, demographic changes, foot traffic patterns in the mall and just about anything else that could influence a customer’s decision to buy your product even down to what they had for lunch just before they made the purchase? The result would most likely be UN-controlled data analysis. This tends to lead to erroneous correlations and bad decisions. For instance, you just might discover that customers are four times as likely to make a purchase if they had pizza for lunch, never realizing that there are more pizza restaurants near your stores than any other type of food service!

When it comes to data, stick with what you know is good data. Make sure it’s clean, reliable and most of all, relevant. Most of all, make sure you CONTROL your data. Without control, there may be no stopping the damage that results.

About Todd: Todd Chittenden started his programming and reporting career with industrial maintenance applications in the late 1990’s. When SQL Server 2005 was introduced, he quickly became certified in Microsoft’s latest RDBMS technology and has added certifications over the years. He currently holds an MCSE in Business Intelligence. He has applied his knowledge of relational databases, data warehouses, business intelligence and analytics to a variety of projects for BlumShapiro since 2011.

BI Call to Action

6 Critical Technologies for the Internet of Things

Picture7If you and your company prefer Microsoft solutions and technologies, you may be fearing that the Internet of Things is an opportunity which will pass you by.

Have no fear: Microsoft’s transformation from “Windows and Office” company to “Cloud and Services” company continues to accelerate.  Nowhere is this trend more evident than in the range of services supporting Internet of Things scenarios.

So – What are the Microsoft technologies that would comprise an Internet of Things solution architecture?

And – How do Cloud Computing and Microsoft Azure enable Internet of Things scenarios?

Here are the key Microsoft technologies which architects and developers need to understand.

Software for Intelligent Devices

First, let’s understand the Things.  The community of device makers and entrepreneurs continues to flourish, enabled by the emergence of simple intelligent devices.  These devices have a simplified lightweight computing model capable of connecting machine-to-machine or machine-to-cloud. Windows 10 for IoT, released in July 2015, will enable secure connectivity for a broad range of devices on the Windows Platform.

Scalable Event Ingestion

The Velocity of Big Data demands a solution capable of receiving telemetry data at cloud scale with low latency and high availability.  This component of the architecture is the “front-end” of an event pipeline which will sit between the Things sending data and the consumers of the data.  Microsoft’s Azure platform delivers this capability with Azure Event Hubs – extremely easy  to setup and connect to over HTTPS.

Still – Volume + Velocity lead to major complexity when Big Data is consumed; the data may not be ready for human consumption. Microsoft provides options to analyze this massive stream of “Fast data”.  Option 1 is to process the events “in-flight” with Azure Stream Analytics.  ASA allows developers to combine streaming data with Reference Data (e.g. Master Data) to analyze events, defects, “likes” and summarize the data for human consumption.  Option 2 is to stream the data to a massive storage repository for analysis later (see The Data Lake and Hadoop).  Regardless of whether you analyze in flight or at rest, a third option can help you learn about what is happening behind the data (see Machine Learning).

Machine Learning

We’ve learned a lot about “Artificial Intelligence” over the past 10 years.  Indeed, we’ve learned that machines “think” very differently than humans.  Machines use principles of statistics to assess which features (“columns”) of a dataset provide the most “information” about a given observation (“row”).  For example, which variable(s) are most predictive (or closely correlated) with the final feature of the dataset?  Having learned how the data is related to one another, a machine can be “trained” to predict the outcome of the next record in the dataset; given an algorithm and enough data – a machine can learn about the real world.

If the IoT solution you envision includes predictions or “intelligence”, you’ll want to look at Azure Machine Learning.  Azure ML provides a development studio for data science professionals to design, test and deploy Machine Learning services to the Microsoft Azure Cloud.

Finally, you’ll also want to understand how to organize a data science project within the structure of your company’s overall project management processes.  The term “Data Science” is telling – it indicates an experimental aspect to the process.  Data scientists prepare datasets, conduct experiments, and test their algorithms (written in statistical processing languages like “R” and “Python”) until the algorithm accurately predicts correct answers to questions posed by the business, using data.  Data Science requires a balance between experimentation and business value.

The Data Lake and Hadoop

A Data Lake is a term used to describe a single place where the huge variety of data produced by your big data initiatives is stored for future analysis.  A Data Lake is not a Data Warehouse.  A Data Warehouse has One Single Structure; data from a variety of formats must be transformed into that structure.  A Data Lake has no predefined structure.  Instead, the structure is determined when the data is analyzed.  New structures can be created over and over again on the same data.

Businesses have the choice of simply storing Big Data in Azure Storage.  If the data velocity and volume exceed certain limits of Azure Storage, Azure Data Lake is a specialized storage service optimized for Hadoop, with no fixed limits on file size.  Azure Data Lake is a service announced in May 2015, and you can sign up for the Public Preview.

The ability to define a structure as the data is read is the magic of Hadoop.   The premise is simple – Big Data is too massive to move from one structure to another, as you would in a Data Warehouse/ETL solution.  Instead, keep all the data in its native format, wait to apply structure until analysis time, and perform as many reads over the same data as needed.  There is no need to buy tons of hardware for Hadoop: Azure HDInsight provides Hadoop-as-a-Service, which can be enabled/disabled as needed to keep your costs low.

Real Time Analytics

The human consumption part of this equation is represented by Power BI.  Power BI is the “single pane of glass” for all of your Data Analysis needs, including Big Data.  Power Bi is a dashboard tool capable of transforming company data into rich visuals. It can connect to data sources on premises, consume data from HDInsight or Storage, and receive real-time updates from data “in-flight”.  If you are located in New England, attend one of our Dashboard in a Day workshops happening throughout the Northeast in 2015.

Management

IoT solutions are feasible because of the robust cloud offerings currently available.  The cloud is an integral part of your solution, and you need resources capable of managing your cloud assets as though they were on premise.  Your operations team should be comfortable turning on and off services in your cloud, just as they are comfortable enabling services and capabilities on a  server. Azure PowerShell provides the operations environment for managing Azure cloud services and automating maintenance and management of those services.

Conclusion

Enterprises ready to meet their customers in the digital world will be rewarded.  First, they must grasp Big Data technologies.  Microsoft customers can take advantage of the Azure cloud to create Microsoft Big Data solutions.  They are designed first by connecting Things to the cloud, then creating and connecting Azure services to receive, analyze, learn from, and visualize the data.  Finally, be ready to treat those cloud assets as part of your production infrastructure, by training your operations team in cloud management tools from Microsoft.

Data Science Foundations – Classification and Regression

The Big Data journey has to start somewhere.  My observation in talking to Microsoft technologists is that, while Big Data is fascinating and exciting, they don’t know where to start.  Should we start by learning Hadoop?  R? Python?

Before we jump into tools, let’s understand how data science works and what can be gained from it.  By now, you understand that Predictive Analytics (or Machine Learning) is a relatively new branch of Business Intelligence.  Instead of asking how our business/department/employee has been performing (recently, and as compared to historical trends), we are now seeking to predict what will happen in the future, based upon data collected in the past.  We can do this at a very granular level.  We can identify “which thing” will behave “which way”.  Some examples: which customer is likely to cancel their subscription plan, which transactions are fraudulent, which machine on the factory floor is about to fail.

There are several approaches to applying statistics and mathematics to answer these questions.  In this blog post, I will focus on two data science tasks: Classification and Regression.

Classification is used to predict which of a small set of classes a thing belongs to.  Ideally, the classes are a small set and mutually exclusive (Male or Female, Republican or Democrat, Legitimate or Fraudulent).   They need not be “either/or”, but it is easiest to think of them in that manner.

Closely related to Classification is the task of predicting the probability that the thing is classified that way.  This is called Class Probability Estimation.  We can determine that a transaction is “Legitimate” with 72.34% certainty, for example.

What can be gained from Classification?  There are many iconic stories of how forward thinking companies anticipating business issues before they arrive – and then take action.  My favorite is story Signet Bank, whose credit card division was unprofitable, due to “bad” customer defaults on loans and “good” customers being lost to larger financial institutions who could offer better terms and conditions.  The answer, revolutionary at the time, was to apply Classification to their customer data.  They separated the “Bad” from the “Good”, cut the “Bad” ones loose and nurtured the “Good” ones with offers and incentives.  Today, we know them as Capital One.

Regression, on the other hand, is a task used to estimate some numeric value of some variable for some thing.  For example, “How much should I expect to pay for a given commodity?”  or “How hot will the temperature be in my home before a human turns the heat down?” This is often confused with Class Probability Estimation.  Classification is related to Regression, but they have different goals.  Classification is for determining whether something will happen.  Regression is for determining how much of something will happen.

What can be gained from Regression?  In manufacturing, it is very useful to understand how much use a particular machine part should be expected to deliver, before performance degrades below an acceptable tolerance level.  Any financial services firm does this routinely to price securities and options.

In my next blog, I will discuss other data science tasks which are related to “Customers who bought this, also bought that”.