Saturday, April 11, 2015

SSIS: An easy SCD optimization for dev and prod

The value of reading this post, depends on how you work with SSIS and how database nursing are handled in within your organization.

The optimization is a single index, but if you only nurse indexes in prod, you could waste a great time when developing SCDs in SSIS. The method is simple, when you now the nature of your SCD, then you can create an index right away, and reduce your development waiting time. Specially if you are testing with bigger volumes of data.

Let me show you

Let's say you have following table definitions, and you working in a SSIS project using Visual Studio:

-- Staging
CREATE TABLE Staging.Customers
(
CustomerId UNIQUEIDENTIFIER,
FistName NVARCHAR(200),
MiddleInitials NVARCHAR(200),
LastName NVARCHAR(200),
AccountId INT,
CreationDate DATETIME2
)
GO 

-- Dimension
CREATE TABLE dbo.dimCustomers
(
CustomerDwhKey INT IDENTITY(1,1),
[Current] BIT, 
CustomerId UNIQUEIDENTIFIER,
FistName NVARCHAR(200),
MiddleInitials NVARCHAR(200),
LastName NVARCHAR(200),
AccountId INT,
CreationDate DATETIME2
CONSTRAINT PK_CustomerId PRIMARY KEY CLUSTERED (CustomerDwhKey)
)
GO


You have a Data Flow, where you transfer data from Staging.Customers to the dimension dbo.dimCustomers using the built-in component Slowly Changing Dimension:


In our example setup CustomerId will be a so-called Business key, And Current will be the indicator for which row are current. It should also be noted, it is possible to have more than one Business keys.


We'll configure attributes as:


Now, the Slowly Changing Dimension component work in following way:

For each entity it recieves, it will search the dimension table for an entity with the same business keys(in plural!!!) and is flagged as current, or in plain SQL:

SELECT  attribute[, attribute] FROM dimension_table WHERE current_flag = true AND business_key = input_business_key[, business_key = input_business_key]

Or as it will look like in our example

SELECT AccountId, CreationDate, FirstName, MiddleInititals, LastName FROM dbo.dimCustomers WHERE [Current] = 1 AND CustomerId = some_key

Further, in case we have an historical change, the current entity in the dimension must be expired by setting Current = 0.

UPDATE dbo.dimCustomers SET [Current] = 0 WHERE [Current] = 1 AND CustomerId = some_key 

The solution

As you might have realize by now, we can improve performance tremendously by putting an index on the current flag and the business keys(again plural!!!). For each entity passing through the Slowly Changing Dimension component, the will be at least 1, but likely 2 searches in the dimension table. And the by knowing the business keys and current flag. and the nature of the Slowly Changing Dimension component, you can predict the index which will improve performance.

The index for our sample will be

CREATE NONCLUSTERED INDEX IX_Current_CustomerId ON dbo.dimCustomers
(
[Current],
CustomerId --- Remember to include each business key
)

Should index be a filtered? I'll let that be up to you.

Indexes, bulk and loading of dimensions

Some tend to drop indexes when loading dimenson, with the argument: Bulk loading is fastest without indexes, which SQL Server has to maintain while loading. This argument has to be revised when working with the Slowly Changing Dimension component.

Because the component searches the dimensions so heavily, (in general) it will be faster loading with indexes than without. If there is no indexes, each entity going through the component, will require at least one table scan, which is quite expensive, and gets more expensive as your dimension grows. 

That's all

No comments:

Post a Comment