Archive for SQL 2012

T-SQL Inner Join Need Not Always Be an Equal Join

Have you ever had to write a T-SQL query that joined a table to itself? The classic example is an employee table that lists the ID of the person’s supervisor or boss, which refers to another row in the same table. This example is nothing like that (Hint: If you do have to do that, do a search for “recursive hierarchy”). Our example today is something different.

In our example we’re going off of the following scenario: given a list of cities on a map, generate a table that can record the distance between any two. In other words, create a list where every value in the table is paired with every other value. For this exercise, we’re not going to be calculating the actual distance, simply creating the records and leaving the space for someone else to fill in. Sounds simple enough. Let’s start with the simplest example: TWO points: New York and San Francisco:

We could easily create a simple three column table with the following entries:

But the distance in one direction is the same as the return trip so really, we only need one row. After deleting the redundant record and adding Chicago to our list of cities, we get this:

 

As you can see, with the addition of just one value to our list of cities, we need two additional rows in the Distances table. As you might expect, the fourth city in the first table would require three more rows in the second. The fifth city would require four more and so on. The growth is almost exponential. At 100 cities, we would need 4950 pairs of cities in the distance table, and at 1,000 we would need close to half a million. The actual number is

{ N * ( N – 1 ) / 2 }.

Time to automate, which is the focus of this article.

Let’s settle on a total of five cities, with the last two being New Orleans and Las Vegas. We could easily build the grid below and fill in any of the appropriate blocks. The black squares are same-city pairs (New York to New York, or Chicago to Chicago, etc.) so we don’t need to worry about them. And if we can get the distance from New York to San Francisco, for example, we don’t need the distance from San Francisco back to New York, so we can also eliminate all the squares marked in gray:

Let’s replace our city names with numbers. After all, any decent database design will have a unique key assigned to each row of the City table.

And our grid now looks like this:

 

So how do we create this mega list of relationships from ONE table? An INNER JOIN of the City table to itself on the City ID column will generate the squares marked in black. But that is exactly what we DON’T want. What about a CROSS JOIN?

That will do it, but with over twice the results we need. Remember that once we know the distance from, say, Chicago to Las Vegas, we don’t also need to record the distance from Las Vegas to Chicago. And we’re still left with some extra rows where the cities are the same, like the distance from New Orleans to New Orleans.

Comparing the City ID columns in the result set above with the City ID axis of the grid, we can see that we’re only looking for pairs (cells) where the ID of City 2 is greater than the ID of City 1. From here, we have two choices: Stick with the CROSS JOIN and add a WHERE clause, or go back to the INNER JOIN and change the ON clause:

The results are the same. In fact, the estimated execution plan from the query optimizer is the same as well.

The point of this whole exercise is this: An INNER JOIN need not always be an EQUAL JOIN, though that is by far the most common occurrence. When faced with unique query needs, think outside the box. In this case, simple substituting a GREATER THAN for the EQUAL in the ON clause did the trick.

Learn more about how BlumShapiro Consulting can help with a project like this and additional application development.

About Todd: Todd Chittenden started his programming and reporting career with industrial maintenance applications in the late 1990’s. When SQL Server 2005 was introduced, he quickly became certified in Microsoft’s latest RDBMS technology and has added certifications over the years. He currently holds an MCSE in Business Intelligence . He has applied his knowledge of relational databases, data warehouses, business intelligence and analytics to a variety of projects for BlumShapiro since 2011. 

How Much is Your Data Worth?

Data is the new currency in today’s modern businesses. From the largest international conglomerate down to the smallest neighborhood Mom-and-Pop shop, data is EVERYTHING! Without data, you don’t know who to bill for services, or for how much. You don’t know how much inventory you need on hand, or who to buy it from if you run out. Seriously, if you lost all of your data, or even a small but vitally important piece of it, could your company recover? I’m guessing not.

“But,” you say, “We have a disaster recovery site we can switch to!”

That’s fine if your racks melt down into a pool of heavy metals on the server room floor, then yes, by all means switch over to your disaster recovery site because molten discs certainly qualify as a “disaster!” Databases hosted on private or public cloud virtual machines are less susceptible, but not immune, to hardware failures.  But what about a failure of a lesser nature? What if one of your production databases gets corrupted because of a SQL Injection hack, cleaned out by a disgruntled employee, or is accidentally purged because a developer thought he was working against the DEV environment? Inadvertent changes to data are no respecter of where such data is stored, or how it is stored! And, sorry to say, clustering or other HADR solutions (High Availability/Disaster Recovery, such as SQL Server Always On technology) may not be able to save you in some cases. Suppose some data gets deleted or is modified in error. These ‘changes’, be they accidental or on purpose, may get replicated to the inactive node of the cluster before the issue is discovered. After all, the database system doesn’t know if it should stop such changes from happening when the command to modify data is issued. How can it tell an ‘accidental purge’ from regular record maintenance? So the system replicates those changes to the failover node. You end up with TWO copies of an incorrect database instead of one good one and one bad! And worse yet, depending on your data replication latency from your primary site to the disaster recovery site, and how quickly you stop the DR site from replicating, THAT may get hosed too if you don’t catch it in time!

Enter the DATABASE BACKUP AND RESTORE, the subject of this article. Database backups have been around as long as Relational Database Management Systems (RDBMS). In my humble opinion, a product cannot be considered a full-featured RDBMS unless it has the capability of performing routine backups and allows for granular restore to a point in time. (Sorry, but Microsoft Excel and Access simply do not qualify.) Being a Microsoft guy, I’m going to zero in on their flagship product: SQL Server, but Oracle, SAP, IBM and many others will have similar functionality. (See the Gartner Magic Quadrant for database systems for a quick look at various vendors, including Microsoft a clear leader in this Magic Quadrant.)

So what is a BACKUP? “Is it not simply a copy of the database?” you say, “I can make file copies of my Excel spreadsheet. Isn’t that the same as a backup?” Let me explain how database backups work and then you can decide the answer to that question.

First of all, you’ll need the system to create a FULL database backup. This is a file generated by the database server system, stored on the file system, the format of which is proprietary to the system. Typically, full backups are taken once per night for a moderately sized database, for example under 100 GB, and should be handled via an automated scheduling service such as SQL Agent.

iStock_000006412772XSmallNext, you’ll need TRANSACTION LOG backups. Log backups, as they are known, record every single change in the database that has occurred since the last full or log backup. A good starting point is scheduling log backups at least every hour, with possible tightening down to every few minutes if the database is extremely active.

Now, to restore a database in the event of a failure, you need to do one very important step: backup the transaction log one last time if you want to have any hope of restoring to a recent point. To perform the actual restore, you’ll need what is known as the ‘chain of backups’ which includes the most recent full backup and every subsequent log backup. During the restore, you will be able to specify a point in time anywhere from the time of the full backup to the time of the latest log backup, right down to the second or millisecond.

So we’re all set right? Almost. The mantra of Database Administrators the world over regarding backups is this: “The backups are only as good and sure as the last time we tested the RESTORE capability.” In other words, if you haven’t tested your ability to restore your database to a particular point in time, you can’t be sure you’re doing it right. Case in point: I saw a backup strategy once where the FULL backups were written directly to a tape drive every night, then first thing in the morning, the IT guys would dutifully eject the tapes and immediately ship them out to an off-site storage location. How can you restore a database if your backups are not available? Case two: The IT guys, not understanding SQL backup functionality and benefits, used a third party tool to take database backups, but didn’t bother with the logs. After four years of this, they had a log that was 15 times the size of the database! So big, in fact, that there was no space available to hold its backup. About a year after I got the situation straightened out with regular full AND transaction log backups going, the physical server (virtualization was not common practice then) experienced a debilitating hardware failure and the whole system was down for three days. Once running again, the system (a financials software package with over 20,000 tables!) was restored to a point in time right before the failure. Having the daily FULL backups saved the financials system (and the company). But also having the log backups saved many people a day’s work if we had had to go back to the latest FULL backup.

So, what’s your data worth? If your data is critical to your business, it is critical that you properly back up the data. Talk to us to learn how we can help with this.

About Todd: Todd Chittenden started his programming and reporting career with industrial maintenance applications in the late 1990’s. When SQL Server 2005 was introduced, he quickly became certified in Microsoft’s latest RDBMS technology and has added certifications over the years. He currently holds an MCSE in Business Intelligence. He has applied his knowledge of relational databases, data warehouses, business intelligence and analytics to a variety of projects for BlumShapiro since 2011. 

Technology Talks Newsletter CTA

Adding User Configurations to an Analysis Server Cube

Part 3: One Row for Each Dimension Member

In Part 2, we created a view that pivoted the Configurations table around a single [Dummy Key] value of -1 and joined that to a dimension. And the dimension we picked had to be one that would never be used with the Configuration Measure Group because the sum of the configuration measure was only available at the UNKNOWN dimension member. Now, perhaps we come to realize that there is no good candidate dimension to which we can join our configuration measure group because it needs to be available to ALL dimensions. What now?

To make this work, we will need to edit our view from Part 2. First, select a dimension with only a few members. I chose my Date dimension for this solution. We’ll need to create the dimension table and populate it with some sample data.

Create the table …

Insert some data …

Now alter the view by adding a CROSS JOIN to the dimension table, and substituting the dimension’s [Date Key] for the static [Dummy Key] as follows:

The CROSS JOIN of the sub-select will return the Cartesian product of all rows in the Date dimension table (7) and all rows in the Configuration table(2), or 14 rows. Each Configuration will have the exact same value for all dates. The PIVOT statement will bring those 14 rows back to 7 rows, with three columns: [Date Key] and two configuration columns.

Refreshing the Data Source View will remove the [Dummy Key] field and add the new [Date Key] column in its place. Next we need to edit the Dimension Usage tab such that the Configuration Measure Group is joined to the Date dimension on the [Date Key] field. But browsing the cube now along any other dimension will yield surprising results: The Configuration value is SUMMED for the number of days in the slice! Well, obviously as that is what we told it to do when we created the Measures back in Part 2! Go back to the Cube Structure tab and edit the individual Measure aggregations in the Configurations Measure Group to “First Non Empty” (if you have Enterprise Edition) or Min or Max (if you have only Standard Edition). Now, regardless of the number of rows in the cube slice, the aggregation will always yield the value desired.

As with Part 2, we still have a single point outside the cube where configurations are stored, but the configurations are available to all dimensions.

Adding additional Configuration entries would involve the following steps:

  • Add a row to the Configuration table
  • Edit the view to include the field based on its [Configuration Name]
  • Refresh the cube project’s Data Source view to get the new field included
  • Add a new Measure to the Configuration measure group.

So far, our solutions have all resulted in global configuration values that if changed, will change for all dimension slices. In our case this is for all dates in the Date dimension. In the next part, we’ll explore a few different methods that will allow for entering configuration values that are date dependent. Since I started using the Date dimension for this series I’m going to stick with it for clarity, and also because that is usually how configurations are needed for the business.

Adding User Configurations to an Analysis Server Cube Part 2

Part 2: Dynamic User Configurations

In Part 1 of this series, we hard-coded some MDX values into the cube. That approach works in that it produces the desired end result, but if the values need to change, a developer is needed to make it happen. What is needed is a way to persist the configuration values outside of the cube itself. In Part 2, we will create a configuration table to store the values. The structure is borrowed from that used by earlier versions of Integration Services:

Really, the only two fields absolutely required here are the Name and Value fields; the other two are added for administration and clarification. Next we’ll insert some fictional values into this configuration table:

Next, create a view that pivots the Configuration Name and Configuration Value fields with T-SQL code like this:


 

The dataset returned by this view will be a single row of data with one column for each Configuration named in the PIVOT section, and a static [DummyKey] value of -1.

 

Again, some purists may dislike my use of “SELECT * FROM …” in my view definition, but since I am limiting the columns returned via the ” . . . FOR ConfigurationName IN (…) . . .” statement of the PIVOT clause, there is not much chance of getting unneeded columns.

 

Next, add this view to the cube project Data Source View, then add it as a Measure Group to your cube. Delete the COUNT and the SUM(Dummy Key) measures that were added by the Measure Group wizard. Since there is only one row in the measure group’s base table, a SUM( ) aggregations for the configurations are fine. Lastly, since a Measure Group MUST be joined to at least one Dimension, on the Dimension Usage tab join the Configuration Measure Group to a dimension in your cube that meets the following criteria:

  1. The dimension has a member row with a key value of -1. (Data Warehouse designers typically add a -1 key as the “Unknown” member of the dimension table.)
  2. You will NOT be using the dimension in conjunction with the Configuration Values. This sounds rather counter intuitive based on cube design practices, but it is explained below.

Browsing the cube by any dimension OTHER than the one used to join the Configuration Measure Group will return the configuration measure values at every cube intersection. This is because you are actually selecting the [All] member of that one dimension, which includes the SUM of each Configuration Value. And since there is only one row at the [Unknown] member (Key = -1), the SUM at the [All] level is the one row. Browsing the cube INCLUDING the one dimension will show that the configuration values are ONLY available for the “Unknown” member, and not for any others. If your configuration values, whatever they represent, will NEVER be used with the dimension you have them joined to, then this is just fine. But if there is any possibility that the Configuration Measures would be needed for any and every dimension in the cube, then you need to do a little editing of the view. We’ll cover that in Part 3.

The advantage of this method over what was covered in Part 1 is that if the Configuration Values ever need to be changed, it is now simply a matter of changing a single value in a table and reprocessing the Measure Group instead of editing the cube design and redeploying the entire cube. To add additional configurations would involve the following:

  1. Add the entry in the table
  2. Edit the view to include the appropriate [ConfigurationName] in the PIVOT clause
  3. Refresh the Data Source View for the cube project
  4. Add a new measure to the Configuration Measure Group for the newly added column
  5. Deploy and process the cube

In Part 3, we will overcome the limitation of NOT being able to use the Configuration Measures for EVERY dimension.