Archive for SQL Server Master Data

T-SQL Inner Join Need Not Always Be an Equal Join

Have you ever had to write a T-SQL query that joined a table to itself? The classic example is an employee table that lists the ID of the person’s supervisor or boss, which refers to another row in the same table. This example is nothing like that (Hint: If you do have to do that, do a search for “recursive hierarchy”). Our example today is something different.

In our example we’re going off of the following scenario: given a list of cities on a map, generate a table that can record the distance between any two. In other words, create a list where every value in the table is paired with every other value. For this exercise, we’re not going to be calculating the actual distance, simply creating the records and leaving the space for someone else to fill in. Sounds simple enough. Let’s start with the simplest example: TWO points: New York and San Francisco:

We could easily create a simple three column table with the following entries:

But the distance in one direction is the same as the return trip so really, we only need one row. After deleting the redundant record and adding Chicago to our list of cities, we get this:

 

As you can see, with the addition of just one value to our list of cities, we need two additional rows in the Distances table. As you might expect, the fourth city in the first table would require three more rows in the second. The fifth city would require four more and so on. The growth is almost exponential. At 100 cities, we would need 4950 pairs of cities in the distance table, and at 1,000 we would need close to half a million. The actual number is

{ N * ( N – 1 ) / 2 }.

Time to automate, which is the focus of this article.

Let’s settle on a total of five cities, with the last two being New Orleans and Las Vegas. We could easily build the grid below and fill in any of the appropriate blocks. The black squares are same-city pairs (New York to New York, or Chicago to Chicago, etc.) so we don’t need to worry about them. And if we can get the distance from New York to San Francisco, for example, we don’t need the distance from San Francisco back to New York, so we can also eliminate all the squares marked in gray:

Let’s replace our city names with numbers. After all, any decent database design will have a unique key assigned to each row of the City table.

And our grid now looks like this:

 

So how do we create this mega list of relationships from ONE table? An INNER JOIN of the City table to itself on the City ID column will generate the squares marked in black. But that is exactly what we DON’T want. What about a CROSS JOIN?

That will do it, but with over twice the results we need. Remember that once we know the distance from, say, Chicago to Las Vegas, we don’t also need to record the distance from Las Vegas to Chicago. And we’re still left with some extra rows where the cities are the same, like the distance from New Orleans to New Orleans.

Comparing the City ID columns in the result set above with the City ID axis of the grid, we can see that we’re only looking for pairs (cells) where the ID of City 2 is greater than the ID of City 1. From here, we have two choices: Stick with the CROSS JOIN and add a WHERE clause, or go back to the INNER JOIN and change the ON clause:

The results are the same. In fact, the estimated execution plan from the query optimizer is the same as well.

The point of this whole exercise is this: An INNER JOIN need not always be an EQUAL JOIN, though that is by far the most common occurrence. When faced with unique query needs, think outside the box. In this case, simple substituting a GREATER THAN for the EQUAL in the ON clause did the trick.

Learn more about how BlumShapiro Consulting can help with a project like this and additional application development.

About Todd: Todd Chittenden started his programming and reporting career with industrial maintenance applications in the late 1990’s. When SQL Server 2005 was introduced, he quickly became certified in Microsoft’s latest RDBMS technology and has added certifications over the years. He currently holds an MCSE in Business Intelligence . He has applied his knowledge of relational databases, data warehouses, business intelligence and analytics to a variety of projects for BlumShapiro since 2011. 

How Much is Your Data Worth?

Data is the new currency in today’s modern businesses. From the largest international conglomerate down to the smallest neighborhood Mom-and-Pop shop, data is EVERYTHING! Without data, you don’t know who to bill for services, or for how much. You don’t know how much inventory you need on hand, or who to buy it from if you run out. Seriously, if you lost all of your data, or even a small but vitally important piece of it, could your company recover? I’m guessing not.

“But,” you say, “We have a disaster recovery site we can switch to!”

That’s fine if your racks melt down into a pool of heavy metals on the server room floor, then yes, by all means switch over to your disaster recovery site because molten discs certainly qualify as a “disaster!” Databases hosted on private or public cloud virtual machines are less susceptible, but not immune, to hardware failures.  But what about a failure of a lesser nature? What if one of your production databases gets corrupted because of a SQL Injection hack, cleaned out by a disgruntled employee, or is accidentally purged because a developer thought he was working against the DEV environment? Inadvertent changes to data are no respecter of where such data is stored, or how it is stored! And, sorry to say, clustering or other HADR solutions (High Availability/Disaster Recovery, such as SQL Server Always On technology) may not be able to save you in some cases. Suppose some data gets deleted or is modified in error. These ‘changes’, be they accidental or on purpose, may get replicated to the inactive node of the cluster before the issue is discovered. After all, the database system doesn’t know if it should stop such changes from happening when the command to modify data is issued. How can it tell an ‘accidental purge’ from regular record maintenance? So the system replicates those changes to the failover node. You end up with TWO copies of an incorrect database instead of one good one and one bad! And worse yet, depending on your data replication latency from your primary site to the disaster recovery site, and how quickly you stop the DR site from replicating, THAT may get hosed too if you don’t catch it in time!

Enter the DATABASE BACKUP AND RESTORE, the subject of this article. Database backups have been around as long as Relational Database Management Systems (RDBMS). In my humble opinion, a product cannot be considered a full-featured RDBMS unless it has the capability of performing routine backups and allows for granular restore to a point in time. (Sorry, but Microsoft Excel and Access simply do not qualify.) Being a Microsoft guy, I’m going to zero in on their flagship product: SQL Server, but Oracle, SAP, IBM and many others will have similar functionality. (See the Gartner Magic Quadrant for database systems for a quick look at various vendors, including Microsoft a clear leader in this Magic Quadrant.)

So what is a BACKUP? “Is it not simply a copy of the database?” you say, “I can make file copies of my Excel spreadsheet. Isn’t that the same as a backup?” Let me explain how database backups work and then you can decide the answer to that question.

First of all, you’ll need the system to create a FULL database backup. This is a file generated by the database server system, stored on the file system, the format of which is proprietary to the system. Typically, full backups are taken once per night for a moderately sized database, for example under 100 GB, and should be handled via an automated scheduling service such as SQL Agent.

iStock_000006412772XSmallNext, you’ll need TRANSACTION LOG backups. Log backups, as they are known, record every single change in the database that has occurred since the last full or log backup. A good starting point is scheduling log backups at least every hour, with possible tightening down to every few minutes if the database is extremely active.

Now, to restore a database in the event of a failure, you need to do one very important step: backup the transaction log one last time if you want to have any hope of restoring to a recent point. To perform the actual restore, you’ll need what is known as the ‘chain of backups’ which includes the most recent full backup and every subsequent log backup. During the restore, you will be able to specify a point in time anywhere from the time of the full backup to the time of the latest log backup, right down to the second or millisecond.

So we’re all set right? Almost. The mantra of Database Administrators the world over regarding backups is this: “The backups are only as good and sure as the last time we tested the RESTORE capability.” In other words, if you haven’t tested your ability to restore your database to a particular point in time, you can’t be sure you’re doing it right. Case in point: I saw a backup strategy once where the FULL backups were written directly to a tape drive every night, then first thing in the morning, the IT guys would dutifully eject the tapes and immediately ship them out to an off-site storage location. How can you restore a database if your backups are not available? Case two: The IT guys, not understanding SQL backup functionality and benefits, used a third party tool to take database backups, but didn’t bother with the logs. After four years of this, they had a log that was 15 times the size of the database! So big, in fact, that there was no space available to hold its backup. About a year after I got the situation straightened out with regular full AND transaction log backups going, the physical server (virtualization was not common practice then) experienced a debilitating hardware failure and the whole system was down for three days. Once running again, the system (a financials software package with over 20,000 tables!) was restored to a point in time right before the failure. Having the daily FULL backups saved the financials system (and the company). But also having the log backups saved many people a day’s work if we had had to go back to the latest FULL backup.

So, what’s your data worth? If your data is critical to your business, it is critical that you properly back up the data. Talk to us to learn how we can help with this.

About Todd: Todd Chittenden started his programming and reporting career with industrial maintenance applications in the late 1990’s. When SQL Server 2005 was introduced, he quickly became certified in Microsoft’s latest RDBMS technology and has added certifications over the years. He currently holds an MCSE in Business Intelligence. He has applied his knowledge of relational databases, data warehouses, business intelligence and analytics to a variety of projects for BlumShapiro since 2011. 

Technology Talks Newsletter CTA

Electrify Your Business with Data

Data is a lot like electricity – a little bit at a time, not too much direct contact, and you’re fine. For example, a single nine-volt battery doesn’t provide enough power to light a single residential bulb. In fact, it would take about a dozen nine-volt batteries to light that single bulb, and it would only last about an hour. It’s only when you get massive amounts of electricity flowing in a controlled way that you can see real results, like running electric motors, lighting up highway billboards, heating an oven or powering commuter trains.

iStock_000006412772XSmallIt’s the same with data. The sale of a single blue shirt at a single outlet store is not much data. And it’s still not much even when you combine it with all the sales for that store in a single day. But what about a year’s worth of data, from multiple locations? Massive amounts of data can do amazing things for us as well. We have all seen in today’s data centric business environment what controlled usage of data can do.

Some examples include:

  • The National Oceanic and Atmospheric Administration (NOAA) can now more accurately predict a hurricane’s path thanks to data that has been collected over time
  • Marketing firms can save money on culled down distribution lists based on customer demographics, shopping habits and preferences.
  • Medical experts can identify and treat conditions and diseases much better based on a patient’s history, life risks and other factors.
  • Big ‘multi-plex’ movie houses can predict more accurately the number of theatres it will need to provision for the latest summer block buster by analyzing Twitter and other social media feeds as related to the movie.

All of this can be done thanks to controlled data analytics.

The key word here is “controlled.” With a background in marine engineering and shore-side power generation, I have seen my share of what can happen when electricity and other sources of energy are not kept ‘controlled.’ Ever see what happens when a handful of welding rods go through a steam turbine spinning at 36,000 RPM and designed for tolerances of thousandths of an inch? It’s not pretty. After as many years in database technologies, data analysis and visualizations, I have also seen the damage resulting from large quantities of uncontrolled data. In his book Signal: Discerning What Matters Most in a World of Noise, author Steven Few shows a somewhat tongue-in-cheek graph that ‘proves’ that the sale of ice cream is a direct cause of violent crime. Or was it the other way around?  It’s an obvious comic hyperbole that serves to illustrate his point that we need to be careful of how we analyze and correlate data.

With the ‘big data‘ explosion, proponents will tell you that ‘if a little is good, then more is better.’ It’s an obvious extension, but is it accurate? Is there such a thing as ‘too much data’?

Let’s say you are a clothing retail store in the mall. Having data for all of your sales over the past ten years, broken down by item, store, date, salesperson and any number of other dimensions may be essential. What if we were to also include ALL the sales of ALL competitors’ products, seasonal weather history, demographic changes, foot traffic patterns in the mall and just about anything else that could influence a customer’s decision to buy your product even down to what they had for lunch just before they made the purchase? The result would most likely be UN-controlled data analysis. This tends to lead to erroneous correlations and bad decisions. For instance, you just might discover that customers are four times as likely to make a purchase if they had pizza for lunch, never realizing that there are more pizza restaurants near your stores than any other type of food service!

When it comes to data, stick with what you know is good data. Make sure it’s clean, reliable and most of all, relevant. Most of all, make sure you CONTROL your data. Without control, there may be no stopping the damage that results.

About Todd: Todd Chittenden started his programming and reporting career with industrial maintenance applications in the late 1990’s. When SQL Server 2005 was introduced, he quickly became certified in Microsoft’s latest RDBMS technology and has added certifications over the years. He currently holds an MCSE in Business Intelligence. He has applied his knowledge of relational databases, data warehouses, business intelligence and analytics to a variety of projects for BlumShapiro since 2011.

BI Call to Action

5 Critical Success Factors for Initiating Master Data Management

If you are an organization looking to improve data quality and business operations through the development of Master Data Management (MDM), then confusion about how to implement it can have drastic consequences.  While business leaders see and live with the challenges, they are likely unclear about the details, and  turn to technology leadership for guidance.  Technology leaders are more apt to see immediately how MDM transforms an organization from one with costly Information Gaps, to one which operates more seamlessly through its ability to create Data Enabled Value.

But if you have never implemented MDM, your preconceived notions about data quality can become a critical obstacle to realizing the benefits on a timeline which the business would accept.  Alternatively, perhaps you simply don’t know where to start.  Phase 1 is the right time to introduce your organization to fundamental MDM concepts, setting expectations for what the ownership framework for master data will look like.  To ensure a successful introduction of MDM and Data Quality practices to your organization, follow these steps.

Don’t Create Another Information Silo – the goal of MDM is to bring break down barriers to clean, high value data assets.  But many technologists are tempted to see the MDM system as “just another database”, which can be designed like any other custom databases in the organization.  MDM is not just another database.  It has its own set of rules and best practices for how to design a simple, clean data model which can be managed by the business.  Finally, it is not necessary to spend big bucks on an MDM database – Microsoft’s solution comes bundled into two editions SQL Server.  Really, don’t build your own.

Don’t confuse CRM with MDM – or any other business system you currently uses, for that matter!  Many CRM software companies (and some ERP’s) like to promote MDM capabilities in their software.  But another goal of MDM is to improve the overall quality of the Master Data and create a space for the Data Governance group to manage that data quality.  If you do not extract master data from its source, then the governance group must manage the data in a process oriented system.  This is akin to tying one arm behind their back.  In the end, governance becomes hindered by the process requirements of a source system.  Instead, create a “data jurisdiction” (which is MDM), extract the data from sources into that jurisdiction and govern inside that jurisdiction.  This brings me to my next point.

Flip the Script – a common frame of reference for data quality is “garbage in, garbage out”.  This frame of reference helps technology leaders explain to the business why data assets have historically been ill-suited for future use: the data entered the system in poor form, and now exits the system in poor form.  The conclusion many draw from this is that in order to improve the data quality, one must enforce better data quality rules at the outset, or “scrub” the data in transit from the source.  Wrong Answer!  Flip the Script: bring the data “as-is” from the sources.  Permit your data stewards and governors to see the data as it exists in the enterprise.  This will lead to a deeper understanding of the very real process “frictions” in play.  Then, use an MDM toolset to enrich, match and harmonize the data in MDM solution itself.  By switching the frame of reference, you can accelerate the project plan and get the solution and data into the hands of stakeholders who are empowered to take action.

Formulate the Data Governance Program– Standing up an MDM system is a project, and should be managed as such.  Data Governance is an ongoing program, and when the project is successfully concluded, the organization must now take ownership of the “Data as an Asset”.  Master Data, by its very nature, implies Shared Ownership.  Each of those data stakeholders are accustomed to managing their own piece of the larger whole.  Differences will inevitably arise.  If you don’t know how to get started, work with an experienced MDM delivery team or borrow an existing framework.

Save Operational MDM for Phase 2– Operational MDM refers to a solution’s capability to distribute and synchronize high quality data to the sources of that data.  Data cleanup of a source system, such an ERP system, is a common goal, one which is widely articulated by business and technology leaders.  I see the virtue here, and indeed have completed very successful projects where master data  flows back to operational systems to drive even more value.  However, these types of projects are lengthy, because a Data Bus Architecture must be implemented alongside MDM to route and transmit data back to systems reliably.  Further, they tend to neglect  a crucial aspect of MDM – all of the Unsanctioned Master Data which resides in Spreadsheets, Databases or in people’s heads.  There is a ton of value created simply by establishing the MDM solution as an authoritative source of high quality, high value data assets.  Don’t make ERP clean up part of your initial goal.  But, do create a data model which will make it easy to do so in a later phase.  MDM projects which defer this requirement show value to the business very quickly, and get funding for later projects more easily.

Master Data Management is a powerful remedy for a number of broad Information Gaps in an organization.  The challenge of implementation is understanding what the goals are, managing expectations, and building a Data Governance and Stewardship culture.  Make sure you understand what can be accomplished quickly with MDM, and focus on those goals in the early stages.