First Accommodate Master Data, Then Clean It

In this blog post, I want to challenge a deeply held notion of Data Quality and Master Data Management.

I have had many, many conversations with technology professionals seeking to implement MDM in their organization. In those first conversations, among the first questions asked is a complex one, disguised as a simple one – How can I start with clean data?

Listen: if you try to start your Master Data implementation with clean master data – you will never get started!

Instead you need to embrace two fundamental realities of Master Data Management. First, there is no clear authoritative source for your master data (if there were, you wouldn’t have a problem). Second: Data Quality is “Front of House” work. The IT department may have data integration, data profiling, third party reference data and matching algorithms in their toolbox, but they can only do so much. IT tools are Back Office Tools and the IT data cleanups happen in the shadows. When they get it wrong, they get it very wrong and comprehensively wrong (and the explanation is hard to understand).

This sequence of events is straightforward, enables the business to take ownership and provides a clear path to getting started.

  • Accommodate your Data – in order for business people to steward and govern their own data – they need to see it with their own eyes, and they need to see all of it, even the data they don’t like. In order to do this, you must:
    • Maintain a clear relationship between data in the MDM hub and its source – don’t attempt to reduce the volume of records. The Federated approach to MDM does this best.
    • Keep rationalization/mapping to a minimum – avoid cleaning the data as you load it. Its wasteful to do it in ETL code when your MDM toolset is ready to do it for you much more easily.
    • Take a “Come as You Are” approach – avoid placing restrictions on the data at this stage of the project, because this only serves to keep data out of your system. We want the data in.
  • Establish Governance of your Data – once you have all of the data loaded into a Federated data model, you have the opportunity to start addressing the gaps
    • First, take some baseline measurements. How incomplete is your data?
    • Next, begin developing rules which can be enforced in the MDM Hub. These rules should be comprehensible to a business user. Ideally, your toolset integrates the rules into the stewardship experience, so that rules declared in the hub are readily available to them. Once you have a healthy body of rules, validate the data and take another baseline measurement
    • Now your data stewardship team can get to work, and you’ll have real metrics to share with the business with regards to the progress you are making towards data compliance.
  • Improve your Data – MDM toolsets automate the process of improving master data sourced from different systems. They do this in three ways:
    • Standardize your Data – MDM tools help data stewards establish standards and check data against those standards
    • Match your Data – MDM tools help data stewards find similar records from multiple systems and establish a grouping of clusters of records. The Group becomes the “Golden Record” – none of the sources get to be the boss!
    • Harmonize your Data – MDM tools help data stewards make decisions about which sources are most authoritative and can automate the harmonization of data within a grouping

Organizations whose starting approach with MDM is “Get the data clean and Keep the data clean” often fail to even get started. Or worse, they spend a lot of time and money requiring IT to clean the data, and then abandon the project after 6 months with nothing to show for it. Clean, then Load is the wrong order: Flip the Script and stick to these principles.

  1. Design a Federated MDM Data model which simplifies identity management for the master data.
  2. Identify where your master data lives and understand the attributes you want to govern initially.
  3. Bring the master data in as it exists in the source systems.
  4. Remove restrictions to loading your data.
  5. Establish some baseline measurements.
  6. Devise your initial rules set.
  7. Use MDM Stewardship tools to automate standardizing, matching and harmonizing.

The Business Value of Microsoft Azure – Part 4 – Virtual Machines

This article is part 4 of a series of articles that focus on the Business Value of Microsoft Azure. Microsoft Azure provides a variety of cloud based technologies that can enable organizations in a number of ways. Rather than focusing on the technical aspects of Microsoft Azure (there’s plenty of that content out there) this series will focus on business situations and how Microsoft Azure services can benefit.

In our last article we focused on data loss from the standpoint of system failure, corruption or other disasters that requires access to a backup. Today, and I’m surprised it took me to part 4 to get to it, we’re going to focus on Virtualization. One of the simplest and most common Infrastructure as a Service (IaaS) solutions is virtualization, or the creation of virtual machines in a cloud infrastructure.

Take, for example, the story of a local town. They have an ERP system that is currently running on Windows Server 2003. The nature of the application is such that it runs best through Remote Desktop (RDP) and is how both their local and remote users access the system. Like many towns they were wrestling with the best path forward for their infrastructure. Here are a few characteristics that defined them:

  1. They had an older ERP system that they needed to upgrade because it was currently running and only supported on Windows Server 2003. Windows Server 2003 hits end of life in July of 2015.
  2. They weren’t certain if the next version of their ERP system was where they wanted to be in the long run, but hadn’t found a suitable replacement as of yet.
  3. Any investments in hardware/software to support the new ERP would therefore need to be questioned given the fact that it was conceivable they would only run it for another year.
  4. They had several locations throughout the town that all connected over RDP to access the ERP system because it was not designed to run well as a client/server system over a WAN.

After an initial assessment it was determined that their existing infrastructure would not be able to support the new environment. A local IT vendor quoted them approximately $50,000 in hardware and software to create a new virtual server environment on-premise. Were it not for the ERP upgrade requirements, their existing hardware/software would continue to be sufficient for a number of years. $50,000 is a significant amount given that the town isn’t sure it is going to stick with the ERP system. What else could this town do?

Enter Microsoft Azure Virtual Machines

In order to give the town some breathing room on making a switch to a different ERP system and ease their need to upgrade their on-premise infrastructure, the town turned to Microsoft Azure. Using the virtualization capabilities of the Azure platform the town created a new RDP environment along with the ERP system server and database. This solution, which the town connected to their existing environment with the site-to-site VPN capabilities of Azure, provided the town with a secure, reliable and easily expandable environment to meet their needs.

The key benefits of this approach were as follows:

  1. Eliminated $50,000 of up-front cost for revamping their existing hardware and shifted them to a reasonable $1,000/month Azure subscription model
  2. Avoided a sunk cost should the town decide to move to a different ERP solution, perhaps one that follows a SaaS model. With Azure, if a set of services is no longer needed you simply turn them off and you don’t get billed further.
  3. Allowed them to continue to get life out of their existing on-premise infrastructure
  4. Established a pattern that could be followed for other applications in that the town now had the option to quickly and easily add additional virtual machines into their Azure subscription to support other workloads.

Before your town or business elects to go down the traditional path of investing 10s of thousands of dollars in a new on-premise infrastructure, take a look at Microsoft Azure for your virtualization needs.

As a partner with BlumShapiro Consulting, Michael Pelletier leads our Technology Consulting Practice. He consults with a range of businesses and industries on issues related to technology strategy and direction, enterprise and solution architecture, service oriented architecture and solution delivery.

5 Reasons to Keep CRM and MDM Separate

In previous articles, I have identified 5 Critical Success Factors for Initiating Master Data Management at your organization, and delved more deeply into the first of these: the creation of a new system which intentionally avoids creating another silo of information.  The second critical success factor is to recognize that MDM tools work best when kept separate from the sources of master data.  A prime example of this is CRM.  Customer Relationship Management (CRM) solutions are often a key topic in my discussions with clients, chiefly with respect to a proposed Customer MDM solution.   I’m going to use CRM to demonstrate why organizations fail to implement Data Governance when they elect to integrate MDM processing into an existing operational system.

It can be enticing to think of CRM as a good place to “do” Customer MDM.   Modern CRM systems are built from the ground up to promote data quality.  Modern CRM solutions have an extensible data model, making it easy to add data from other systems.    Customer data often starts in CRM: following the “Garbage In, Garbage Out” maxim, it seems important to get it right there first.  Finally, software vendors often claim to have an integrated MDM component, for Customers, Products, Vendors and others.

But here are the problems this approach creates:

  1. More than Data Quality – if an operational system like CRM can offer address standardization, third party verification of publicly available data, or de-duplication of records then you should leverage these services.  But keep in mind – these services are to help you achieve quality data for the purposes of making operations in that system work smoothly.  If you have only one operational system, then you probably have no need for MDM.  If you have more than one, and you pick one as the winner, you’ll tie the two systems together very closely, making future integrations extremely daunting.
  2. Data Stewardship Matters – Data Stewardship refers to a role in the organization responsible for maintenance and quality of data required throughout the organization.  In a well-designed Data Governance Framework, data stewards report to the governance team.  It’s not always possible for an organization to have dedicated data stewards; more often,  ”Data Steward” is one role added to operational responsibilities.  Now, I would love to tell you that CRM users care about data quality; many of them do.  But sales professionals are often focused on the data they need to close a deal, not the myriad other pieces of information needed to truly drive customer engagement.  Asking them to be responsible for doing so sets the organization up for failure.
  3. Governors Don’t Play Favorites – an MDM system should have the ability to store and represent data as it actually exists and is used in ANY source of master data.  Without this, your data stewardship team cannot really see the data.  If you insist on making CRM the source for master data, your technology team will spend all of their time mapping and normalizing data to match what CRM needs and wants.  This is a waste of time.  The Federation MDM model is designed to move data in quickly and show data stewards how things really look.  Then, and only then, can decisions be made (and automated) about which systems adhere most closely to Enterprise Standards for quality.
  4. Information Silo or Reference Data Set – CRM meets the definition of an Information Silo: it has its own database, and it invents its own identities for Customers, Accounts, Leads, etc.  What happens when an account must be deactivated or merged with another account in order to streamline operational processes.  Well, if any systems are using CRM as their Reference Data Set, you will have massive problems.
  5. Present at Creation – you probably realize that there are lots of sources of Customer Data, some the business likes to talk about, and some they don’t.  I like to separate the two into Sanctioned and Unsanctioned Master Data.  Unlike Sanctioned Master Data, which lives in CRM and ERP and other operational systems managed by IT, Unsanctioned Master Data lives in spreadsheets and small user databases (ex. Microsoft Access) or even websites.  This may surprise you – unsanctioned master data is often the most valuable data in the governance process!  This is where your analysts and knowledge workers are storing important attributes and relationships about your customers, and the source of real customer engagement.  MDM needs to make room for it.

One of the most common misconceptions about how to build an MDM system is the idea that Master Data Management can be best achieved by maintaining a Golden Record in one of many pre-existing operational systems.  This can be a costly mistake and sink your prospects for achieving Data Governance in the long term.  A well implemented Master Data Management system has no operational process aim other than high quality master data.  It must take this stance in order to accept representations of Master Data from all relevant sources.  When this is accomplished, it creates a process agnostic place for stewardship, governance and quality to thrive.

Fun with Machine Learning

All my blog articles over the years have been technical in nature. I decided to break out of that mold today. I almost titled this article “It’s not a train robbery, it’s a science experiment” (Doc Brown, in Back to the Future III). I hope you enjoy reading it as much as I did writing it.

The title is not meant to imply that machine learning isn’t inherently fun (I personally happen to think it’s a cool use of aggregated technologies). Rather it’s to say that we’re going to have some fun with machine learning in a way you wouldn’t have otherwise considered. But in order to do so, the reader must understand at least the fundamental concepts of machine learning. Don’t worry, we’re not going to be diving into data mining algorithms or the R language or python code or anything remotely technical. Instead, a real life analogy is best, and we’ll dumb this one right down to the level of a two-year-old toddler! Kids between the ages of about one and six are GREAT at ‘machine learning,’ but NOT in the LEARNING side of machine learning. No, they’re on the TEACHING side of machine learning, the ‘writing of the algorithms’, the ‘Python and R code’, that the ‘machines’ (their parents) use to learn. Let’s take a look at how this works.

Ever try to get a two-year-old to eat something he or she just does NOT want to eat? Like broccoli or cauliflower? Even adults are split about evenly on the likes and dislikes of vegetables. Two-year-olds, on the other hand, tend to swing to the dislike side on just about all varieties. So what happens? The child absolutely will not eat said vegetables. Babies and toddlers being spoon-fed from a jar tend to take a different and sometimes visually humorous approach: they let you spoon it into their mouth, but it quickly comes back out like toothpaste accompanied by a grimace. Having wasted an entire jar of baby food on the bib, the father (as a new father I had to take my turn feeding the kids!) turns to his wife and says, “Honey, he doesn’t like the green beans, but he loves the applesauce.” “OK,” comes the reply, “I won’t buy the beans again.”

What just happened here? Believe it or not, that was “machine learning” on a micro scale. The ‘machine’, the parents, just ‘learned’ something. Two data points, in fact. Green beans are icky, while applesauce gets a ‘thumbs up.’ Now if all the toddlers in the town were to teach those various bits of knowledge to their respective parents, you have just built yourself a ‘reference dataset.’ Suppose now a bunch of those mothers interact at the weekly “Mommy and Me”. Just now joining their group is a new mother whose daughter is ready to switch from the bottle to semi-solid food. The discussion is likely to descend around what each child likes and dislikes in that area. The new mother listens intently and comes away with knowledge of what her daughter is MOST LIKELY to prefer, but WITHOUT actually having to experience a bib full of pureed sweet potatoes! This is machine learning in action. The machine has applied an algorithm to a reference dataset to predict a probable outcome.

Now, no child dislikes ALL foods, even three and four-year-olds, as much as some parents tend to perceive. (My six-year-old son wouldn’t eat a peanut butter and honey sandwich unless it was cut diagonally! Go figure!) If you think your child dislikes ALL foods, it’s more likely he or she only dislikes all the foods YOU like. Since you’re not likely to buy stuff you personally wouldn’t eat, the child has no chance to find what he or she actually enjoys. The parents will then broaden their variety to find something acceptable.

Let’s take a look at another real world scenario, this time closer to the topic at hand.

Many on-line retailers use machine learning and data mining to present to the consumer things they are MOST LIKELY to purchase based on any number of information points and reference datasets. These include your past purchases, your demographics, and the things other consumers have purchase together. The algorithms employed can be ‘market basket analyses’, ‘clustering’, or others (and I promise that’s as technical as we’ll get in this article). We’ve all seen it in action at Amazon and Netflix. “Based on your viewing history…” or “People who bought X also bought…” Even grocery stores learned that beer was often purchased in conjunction with diapers. Seems that young mothers often sent their husbands to the store in times of diaper needs, hence the beer.

I decided to try an experiment this morning, and this is where the fun comes in. I wanted to take a finicky two-year-old’s stance on my internet steaming audio. Pandora, Rhapsody, iHeartRadio and the like often apply machine learning type of logic to decide the next song to queue to your personal listening stream based on your likes and dislikes. What would happen if I started a new ‘radio station’, then flagged every song it presented to me as ‘thumbs down?’ Would it just keep letting me spit out the offerings until it found something I actually liked? What if I didn’t like ANYTHING? Would it cut me off and kick me out for being impossible to please? I decided I just had to find out.

I started by naming my new station “Billy Joel.” (Hey, if the experiment were to fail, I figured why not fail with something decent!) Within 5 seconds starting the first song, I had hit the ‘Thumbs down’ button. OK, no, problem, it moved on to the next. Six more songs were dispatched in similar fashion. “Hey, this is fun” I thought. On the next song, however, it allowed me to dislike it, but I was forced to listen to the entire song while a banner displayed the message about not being fed that particular vegetable variety again. Five more disliked songs all brought up the same message while still playing the song to completion. Oh, well, at least I had some good music to listen to. After a dozen similar results, and realizing I wasn’t getting anywhere trying to fool the machine, I threw it a curve and hit the ‘Thumbs up’ on the next few tracks. (I think I smelled smoke coming from my router.) The next six tracks were all skipped by flagging as disliked in similar fashion to the first batch. I settled into a back-and-forth of liking and disliking bunches of songs in groups. In the end, I had to like at least a couple of songs it presented to me before I could dislike AND SKIP a bunch of other tracks.

After two hours the machine won as I had to produce some useful work at the office. There was a practical limit to how much it could ‘learn’ from this picky two-year-old music consumer. Likewise, parents all think they win in the end, too, or do they? They will tell you they eventually ‘got their child to like’ certain foods when in fact they simply settled on a repertoire of foods that their child wouldn’t reject, kind of like…wait for it…machine learning.